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Abstract 

One of the limiting factors of using support vector machines (SVMs) in large scale applica¬ 
tions are their super-linear computational requirements in terms of the number of training 
samples. To address this issue, several approaches that train SVMs on many small chunks 
of large data sets separately have been proposed in the literature. So far, however, almost 
all these approaches have only been empirically investigated. In addition, their motivation 
was always based on computational requirements. In this work, we consider a localized 
SVM approach based upon a partition of the input space. For this local SVM, we derive a 
general oracle inequality. Then we apply this oracle inequality to least squares regression 
using Gaussian kernels and deduce local learning rates that are essentially minimax opti¬ 
mal under some standard smoothness assumptions on the regression function. This gives 
the first motivation for using local SVMs that is not based on computational requirements 
but on theoretical predictions on the generalization performance. We further introduce a 
data-dependent parameter selection method for our local SVM approach and show that 
this method achieves the same learning rates as before. Finally, we present some larger 
scale experiments for our localized SVM showing that it achieves essentially the same test 
performance as a global SVM for a fraction of the computational requirements. In addition, 
it turns out that the computational requirements for the local SVMs are similar to those 
of a vanilla random chunk approach, while the achieved test errors are significantly better. 

Keywords: least squares regression, support vector machines, localization 


1. Introduction 


Based on a training set D := , {xn,yn)) of i.i.d. input/outpnt observations 

drawn from an unknown distribution P on A x V, where A C and A C M, the goal of 
non-parametric regression is to find a function /^ : A —>• M such that important charac¬ 
teristics of the conditional distribution P(A|x), x G X, can be recovered. For instance, an 
fp) approximating the conditional mean IE(A|x), x G A, is sought in the non-parametric 
least squares regression. This classical non-parametric regression prob lem has been exten ¬ 


sively studied in the literature, where a general reference is the book (|Gvorfi et al.l . 120021 ). 


presenting plenty of results concerning the non-parametric least squares regression. 

In the literature, there are many learning method s that solve the non-parametric re¬ 
gression problem s, some of them are e.g. described in ( Gvorfi et ah . 20021 : Koenker . 2005 : 
Simonofl . I99(il ). In this paper, we utilize some kernel-based regularized empirical risk 
minimizers, also known as support vector machines (SVMs), which solve the regularized 
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problem 


/d,a G arg mill A 


]{ +'^L,D (/) ■ 


( 1 ) 


Here, A > 0 is a fixed real number and H is a. reproducin g kernel Hi l bert s pace (RKHS) 
over X with r eproducing kernel fc : X x X ^ M, see e.g. (Aronszain . 1950l: Berlinet and 
Thomas-Agnan, 2004 : Steinwart and ChristinannT 2008alh Besides, TZl,T) if) denotes the 
empirical risk of a function / : X —)• M, that is 


(/) = - V T f {xi)) , 
n 


2=1 


where D is the empirical measure associated to the data D defined by D := ^ Ym=i 



Section 6.4) for more details. Besides, it is worth mentioning that the ability to choose 
the RKHS H as well as the loss function L in ([T]) provides the possibility to flexibly apply 
SVMs to various learning problems. Namely, the learning target is modeled by the loss 
function, e.g. the least squares loss is used to estimate the conditional mean. Moreover, 
since RKHSs are defined on arbitrary X, data types that are not M'^-valued can be handled, 
too. Furthermore, SVMs are enjoying great popularity, since they can be implemented and 
applied in a relatively simple way and only have a few free parameters that can usually be 
determined by cross validation. 


An essential theoretical task, which has attracted many considerations, is the inves- 


squares loss and generic kernels can be found in (Gucker and Smak. 2002: De Vito et ah. 

2005: Smale anc 

o 

o 

(M 

o 

N 

:lGaDonnetto and De Vito. 2007: Mendelson and Neeman. 201C: 

Steinwart et ah. 

2009) and the references therein, while similar rates for SVMs using the 


not w ant to take a closer look at these results, instead we relegate to (jEberts and Steinwartl . 


20131 ). w here a detailed discuss i on ca r i be fo und. More important for our purposes is the 


fact that lEberts and Steinwart] (1201 ll. l2013h establish (essentially) asymptotically optimal 
learning rates for least squares SVMs (LS-SVMs) using Gaussian RBF kernels. More pre¬ 
cisely, for a domain X C B^d, Y := [—M, M] with M > 0, a distribution P on X x V 
such that Px has a bounded Lebesgue density on X, and for f* contained in the Sobolev 
space VF^(Px), a G N, or in the Besov space ot >1, respectively, the LS-SVM 

using Gaussian kernels learns for all ^ > 0 with rate n with a high probability. 

Although these rates are essentially asymptotically optimal, they depend on the order of 
smoothness of the regression function on the entire input space X. That is, if the regres¬ 
sion function f* is on some area of X smoother than on another area, the learning rate is 
determined by the part of X, where the regression function f* is least smooth (cf. Figured]). 
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In contrast to this, it would be desirable to 
achieve a learning rate on every region of X that 
corresponds with the order of smoothness of f* 
on this region. Therefore, one of our goals of 
this paper is to modify the standard SVM ap¬ 
proach such that we achieve local learning rates 
that are asymptotically optimal. Our technique 
to achieve such local learning rates is a special 
local SVM approach. Local SVMs have been ex¬ 
tensively investigated in the literature to speed¬ 
up the training time, see for instan ce, the early 
works ( Bottou and Vaonik . 19^: Vaonik and 



Bottou, 1993h . The basic idea of many local ap¬ 


proaches is to a) split the training data and just 
consider a few examples near a testing sample, 
b) train on this small subset of the training data, 
and c) use the solution for a prediction w.r.t. the 


Figure 1: The input space X is partitioned by 
and such that the regression 
function /* is less smooth on X^‘^'> compared 
to Xd) and X^^L However, it is desirable to 
achieve locally optimal learning rates. 


test sample. Here, many up-to-date investigations use SVMs to train on the local data set 
but, vet there are different wavs to split the whole training data set into smaller, local sets. 

For example. Chang et al. 

2OI0l): Wu et al. (1999): Bennett and Blue (1998) use decision 

trees while in (IHabk. 2013: 

Segata and Blanzieri. 2O10l. I 2 OO 8 I: Blanzieri and Melgani. 20081: 

Blanzieri and Brvl 2007a bl: Zhang et ah. 200fih local subsets are built considering k nearest 

neighbors. The latter approaches further varv. for example. Zhang et al.l (2006): Blanzieri 

and Bryl (2007a): Hable (2 

013) consider different metrics w.r.t. the input space whereas 

Segata and Blanzieri (2008) 

: Blanzieri and Melgani ( 20081) : Blanzieri and Brvll ( 2007bl) con- 


sider metrics w.r.t. the feature space. Nonetheless, the basic idea of all these articles is 
that an SVM problem based on k training samples is s olved for each test sample. Another 


approach using k nearest neighbors is investigated in (jSegata and Blanzieril . 120101 ). Here, 


^-neighborhoods consisting of training samples and collectively covering the training data 
set are constructed and an SVM is calculated on each neighborhood. The prediction for 
a test sample is then made according to the nearest training sample that is a center of a 
A:-neighbor hood. As for the other nearest neighbor approaches, however, the results are 


mainly experimental. An exception to this rule is (jHabld . [2013|), where universal consis¬ 


tency for localized versions of SVMs, or more precisely, a large class of regularized kernel 
methods, is proven. Another article prese nting theoretical results for localized versions of 
learning methods is (IZakai and Ritovl.l2nn9h . Here, the authors show that a consistent learn¬ 


ing method behaves locally, i.e. the prediction is essentially influenced by close by samples. 
However, this result is based on a localization technique considering only training samples 
contained in a neighborhood with a fixed radius and center x w hen an estimate in x is 
sought. Prob a bly c losest to our approach is the one examined in ( Cheng et ah . 20101 ) and 


(jCheng et al.l . 120071 1 , where the training data is splitted into clusters and then an SVM is 


trained on each cluster. However, the presented results are only of experimental character. 


In this article, we partition the input space X according to a cover of X with radius 
and build an SVM model for each partition cell. The following section is dedicated to the 
detailed description of this method. Section [3] then presents some theoretical results that 
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enable the analysis of this new method. For example, we examine extensions and direct 
sums of RKHSs. At the end of Section [3l we finally present a first oracle inequality for 
the localized SVM. In Section 01 we focus on RKHSs using Gaussian RBF kernels and, in 
conjunction with that, we study some entropy estimates. After that, Section [5] concentrates 
on the least squares loss and introduces an oracle inequality and learning rates for our 
localized SVM method using Gaussian kernels. Moreover, a data-dependent parameter 
selection method is studied that induces the same rates. Section El then presents some 
experimental results w.r.t. the localized SVM technique. All proofs can be found in Section 
[71 and the appendix contains various tables displaying detailed results of our experiments. 


2. Description of the Localized SVM Approach 

In this section, we introduce some general notations and assumptions. Based on the latter 
we modify the standard SVM approach. Let us start with the probability measure P on 
X X Y, where X C is non-empty and Y := [—M, M] for some M > 0. Depending on 
the learning target one chooses a loss function L, i.e. a function L : X x Y x M —)• [0, oo) 
that is measurable. Then, for a measurable function / : V ^ M, the L-risk is defined by 

= [ L{x,y,f{x))dP{x,y) 

JXxY 

and the optimal L-risk, called the Bayes risk with respect to P and L, is given by 


TZ*l p := inf {'JZl,p (/) | / : V —>• M measurable} . 


A measurable function /£ p ; V —>■ M with 7^L,p(/2p) = ”^2? called a Bayes decision 
function. For the commonly used losses such as the least squares loss treated in Section [5] 
the Bayes decision function /£ p is Px-almost surely [—M, M]-valued, since Y = [—M,M]. 
In this case, it seems obvious to consider estimators with values in [—M, M] on X. To this 
end, we now introduce the concept of clipping the decision function. Let t be the clipped 
value of some t € M at ±M defined by 


t := < 


-M 

t 

M 


if t < -M 
if t G [-M, M] 
if t > M . 


Then a loss is called clippable at M > 0 if, for all (x,y,t) & X x Y x M, we have 

L{x,y,t) < L{x,y,t ). 


Obviously, the latter implies 


^L,p(/) < T^L,p{f) 


for all / : V —)• M. In other words, restricting the decision function to the interval [—M, M] 
containing our labels cannot worsen the risk, in fact, clipping this function typically reduces 
the risk. Hence, we consider the clipped version fu of the decision function as well as the 
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risk 7 ^l,p(/d) instead of the risk 7 ^l,p(/d) of the undipped decision function. Note, this 
clipping idea does not change the learning method since it is performed after the training 
phase. 

To modify the standard SVM approach ([1]), we assume that is a partition 

of X such that Aj A 0 every j G m}. Obviously, this implies Aj^ n Aj^ = 0 for 

all ji, J 2 G {1 ,..., m} with ji A h and 


m 

X=(jA,. 

i=i 

Now, the basic idea of the approach developed in this paper is to consider for each set of the 
partition an individual SVM. To describe this approach in a mathematically 

rigorous way, we have to introduce some more definitions and notations. Let us begin with 
the index set 


Ij := {i £ {1,... ,n} : Xi £ Aj} , j = 

indicating the samples of D contained in Aj, as well as the corresponding data set 

Dj := {{xi, Vi) £ D i £ Ij} , j = 1, ..., m . 

Moreover, for every j £ {1,..., m}, we define a (local) loss function Lj : X xY xM. ^ [0, oo) 
by 


Lj{x,y,t) :=lAjix)L{x,y,t), (2) 

where L:VxVxM^[0,oo)is the loss that corresponds to our learning problem at hand. 
We further assume that Hj is an RKHS over Aj with kernel kj : Aj x Aj — )• M. Note that 
every function / G Hj is only defined on Aj even though a function /p) : V —)• M is finally 
sought. To this end, for f £ Hj, we define a function / : V —>• M by 


fix) 


f{x), x £ Aj, 
0 , x ^ Aj . 


Then the space Hj := {f : f £ Hj} equipped with the norm 


II/IIrv=II/IIh,, f&Hj, 

is an RKHS on X (cf. Lemma [2]). That is, Hj is an isometrically isomorphic extension of 
the RKHS Hj on Aj to an RKHS on X. After all, we are now able to formulate a modified 
SVM approach. To this end, for every j £ {!,..., m}, consider the local SVM optimization 
problem 


/d,' 


= arg min A,- 


H, 


+ 


n 


'^Lj{xi,yi,f{xi)) 

i=l 


(3) 
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where \j > 0 for every j G {1,... ,m}. Based on these empirical SVM solutions, we then 
define the decision function /d,a : X —M by 

m m 

/d,a(x) := ^ /d,,Aj {x) = 1 a, {x)fT)j,\j (x) , (4) 

i=i j=i 

where A := (Ai,..., Xm)- Here, clipping f^ x at M yields 


h,x{x) = Y ^Ajix)fDj,Xjix) 
i=i 

for every x & X. Note that the empir ical SVM solutions /d,',a, ia © exist and are unique 
by ( Steinwart and ChristmannI . 2008a . Theorem 5.5) and that, for arbitrary j G {1,..., m}, 
/D,,Aj = 0 if Xj ^ Aj for all z G {1,... ,n}. In addition, the SVM optimization problem © 
equals the SVM optimization problem © using Hj, Dj, and the regularization parameter 
Xj := j^Xj, since, for / G Hj and / := /|a, ) we have 

- n 1 ^ 


i=l 


n 


ieii 


= ^(A, 
n 


H, 




That is, /d,.,a,' as in © and /i^. 3 ^. := argmin/g^, +^l,d,(/) satisfy 


- /OiAqA,. ■ 

For the sake of completeness, we briefly examine the Bayes risks w.r.t. P and Lj. To 
this end, let X C Y C M, T : X x V x M —)• [0, 00 ) be a loss function and P be a 
distribution on X x V such that a Bayes decision function /£ p : X —>■ R exists. Then, for 
all J G {1,..., m} and losses Lj defined by ©, it is easy to show 


^l„p(/2,p) = , 

whenever /£ p exists. In other words, a Bayes decision function /£ p w.r.t. P and L addition¬ 
ally is a Bayes decision function w.r.t. P and Lj. Moreover, for function spaces Xi,..., Xm 
over X, we have 


m 


E 


niin Xl.,d(/j) 
Jj ^'^3 


mm 


j=i 


D 


(/. 


(5) 


by the construction of the loss Lj. 

Let us now present an advantageous characteristic of our modihed SVM, namely the 
required computing time. Solving an usual SVM problem has a computational cost of 0{n3) 
where q G [2, 3] and n is the sample size. For the new approach we consider m working 
sets of size ni,..., rim where n* Uj for all i,j G {!,..., m}, i.e. n* « Then for each 
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working set an usual SVM problem has to be solved such that, altogether, the modified 
SVM induces a computational cost of O (m (^)'^)- That is, for some /3 > 0 and m ^ 
our approach is computationally cheaper than a traditional SVM. Note that our strategy 
using a partition of the input space is a typical way to speed-up algorithms and handle large 
data sets. Other techniques that possess simil ar properties are e .g. ap plied in the articles 


cited in the introduction. Besides, we refer to (ITsang et ahl. 120071) an d (jTsang et ahl. 1200511 


using enclosing ball problems to s olve an SVM, to ( Graf et ah . 2005l l presenting an model 
of multiple filtering SVMs and to (jCollobert et ahl . 120011 ) investigating a mixture of SVMs 
based on several subsets of the training set. 

To describe the above SVM approach only has to be some partition of X. 

However, for the theoretical investigations concerning learning rates of our new approach, 
we have to further specify the partition. To this end, we denote by Bg^d the closed unit ball 
in the d-dimensional Euclidean space £2 we define balls Bi,, Bm with radius r > 0 
and mutually distinct centers ,..., £ V by 


Bj := Br{zj) := {x & X : \\x — Zj\\2 < r} , jG{l,...,m}, (6) 

where || • II 2 is the Euclidean norm in M'^. Moreover, choose r and zi,... ,Zm such that 

m 

[jBj=X, 

i=i 

i.e. such that the balls Hi,..., Bm cover X (cf. Eigure[2]). The following well-known lemma 
relates the radius of such a cover with the number of centers. 

Lemma 1 Let X C be a bounded subset, i.e. X C cBgd for some constant c > 0. Then 
there exist balls with radius r > 0 eovering X such that 

r < 8cm d . 


Eor simplicity of notation, we assume in the following that X C Bgd, i.e. according to 
Lemma [T] there exists a cover with 


r < 8m d . ( 7 ) 

Einally, we can specify the partition of X by the following assumption. 

(A) Let m be a partition of V C Bgd such that Aj / 0 for every j G {!,..., m} 

and such that there exist mutually distinct zi,... ,Zm £ X with Aj C Br{zj) =: Bj, 
where is a cover of X satisfying l]7|). 

In the remaining sections we will frequently refer to Assumption (A). However, the results 
hold as well if we merely assume zi, ... ,Zm £ Bgd instead oi zi,. .., Zm & X C Bgd in (A). 
The following example illustrates that (A) is indeed a natural assumption. 
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Figure 2: Cover of X, where 

-Bi,..., Bm are balls with radius r and cen¬ 
ters Zj {j = 1,... ,m). 


Figure 3: Voronoi partition of 

X defined by ([8|), where Aj C Bj for every 
j G {1,... ,m}. 


Example 1 For some r > 0, let us consider an r-net zi,... ,Zm of X, where zi,... ,Zm 
are mutually distinct. Based on these zi,...,Zm, a Voronoi partition of X is 

defined by 

Aj :=< X ^ X : j = min{ argmin ||x — Zk\\ 2 } c > (8) 

cf. Figure 0 That is, Aj contains all x G X such that the center Zj is the nearest center to 
X, and if there exist ji and j2 with ji < j 2 and 

Ik “ = Ik “ ^j2\h < Ik “ ^fclb 

for all k G {1,..., m}\{ji,j 2 }, then x G Aj.^ since ji < j 2 . In other words, they are resolved 
in favor of the smallest index of the involved centers. Moreover, it is obvious that Aj 0, 
Aj C Bfizj) for all j G {I,..., m}, Aj^ Cl Aj^ = 0 for all ji, j 2 G {1,..., m} with ji / j 2 , 

and X = Ujki Aj. In other words, a Voronoi partition based on an r-net zi,... ,Zm of X 

satisfies condition (A), if r and m fullfil dTj). 

Following Example dl we call the learning method producing given by ([1|) a Voronoi 
partition support vector machine, in short VP-SVM. Nevertheless, we just take a partition 
satisfying (A) as basis here instead of requesting to be a Voronoi 

partition. 

Recall that our goal is to derive not only global but also local learning rates for this VP- 
SVM approach. To this end, we additionally consider an arbitrary measurable set T <Z X 
such that Vx{T) > 0. Then we examine the learning rate of the VP-SVM on this subset T 
of X. To formalize this, it is necessary to introduce some basic notations related to T. Let 
us define the index set Jt by 

Jt ■= {j G {1,..., m} : Aj n T / 0} (9) 

specifying every set Aj that has at least one common point with T. Note that, for every 

non-empty set T C X, the index set Jt is non-empty, too, i.e. \Jt\ > 1- Besides, deriving 
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local rates on T requires us to investigate the excess risk of the VP-SVM with respect to 
the distribution P and the loss Lt ■ X xY x M —>• [0, oo) defined by 

Lt{x, y, t) := \t{x)L{x, y, t) . (10) 

However, to manage the analysis we additionally 
need the loss Lj^ : X xY x M —>■ [0, oo) given 
by 

Lj^{x,y,t) :=l\j.^^^A^{x)L{x,y,t) (11) 

which may only be nonzero, if x is contained in 
some set Aj with j G Jt- Note that the risiks 
^Lt,p(/) 'XLj^,p{f) quantify the quality of 
some function / just on T and 

^t:= U AjDT, 
j&Jr 

respectively. Hence, examining the excess risks 

7^L,,p(/D.A) - < 7^L,^,p(/D,A) - 

leads to learning rates on At and implicitly on 
T. Recapitulatory, let us declare a second set of assumptions. 

(T) For T C X, we define an index set Jt by ([9]), loss functions Lt,Ljj, : X xY x M ^ 
[0, oo) by ([Toll and (fTT|) . and the set At := UjsJt 

3. An Oracle Inequality for VP-SVMs 

In this section, we first focus on RKHSs and direct sums of RKHSs. Then we present a 
lemma that relates the risk of a function w.r.t. the general loss L to the risks w.r.t. the 
losses Lj. Finally, we establish a first oracle inequality for VP-SVMs. 

Let us begin with some basic notations. For q G [1, oo] and a measure v, we denote by 
Lq{i/) the Lebesgue spaces of order q w.r.t. u and for the Lebesgue measure // on A C 
we write Lq{X) := Lq(y). In addition, for a measurable space X, the set of all real-valued 
measurable functions on X is given by Cq{X) := {/ ; A —>■ M | / measurable}. Moreover, 
for a measure t on A and measurable A C A, we define the trace measure of t in A 

by tx\x{A) = u{A n A) for every A <Z X. 

Our first goal is to show that /d,a in dl]) is actually an ordinary SVM solution. To 
this end, we consider an RKHS on some A C A and extend it to an RKHS on A by the 
following lemma, where we omit the obvious proof. 

Lemma 2 Let A (Z X and Ha be an RKHS on A with eorresponding kernel kA- Denote 
by f the extension of f & Ha to X defined by 

f{x), forxGA, 

0 , for x G X\A . 




Figure 4: The input space A with the corre¬ 
sponding partition (Rj)j=i,...,m and the sub¬ 
set T, where the local learning rate should be 
examined. 
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Then the space Ha ■= {f '■ f ^ Ha} equipped with the norm 

\\f\\H,--= \\f\\H, 

is an RKHS on X and its reproducing kernel is given by 


kA{x,x') 


kA{x,x'), ifx,x'&A, 
0, else. 


( 12 ) 


Based on this lemma, we are now able to construct an RKHS by a direct sum of RKHSs 
Ha and Hb with A,BcX and An B = 0. Here, we skip the proof once more, since the 
assertion follows immediately using, for example, orthonormal bases of Ha and Hb- 


Lemma 3 For A,B<zX such that ACi B = $ and Au B C X, let Ha and Hb be RKHSs 
of kA and kB over A and B, respectively. Furthermore, let Ha o,nd Hb be the RKHSs of all 
functions of Ha and Hb extended to X in the sense of Lemma and let kA and Lb given 
by (USD be the associated reproducing kernels. Then Ha F Hb = {0} and hence the direct 
sum 

H := Ha (B Hb (13) 

exists. For Aa, > 0 and f & H, let fA G Ha and fB^HB be the unique functions such 
that f = fA + fs- Then we define the norm || • ||jt by 

11/111.:= Aa||/a||J^+Ab||/b|||^ (14) 

and H equipped with the norm || • \\h is again an RKHS for which 

k{x, x') := Xf^kA{x, x') + Xf^kB{x, x '), x,x' £ X , 
is the reproducing kernel. 

To relate Lemmas [2] and [3] with (jj]), we have to introduce some more notations. For 
pairwise disjoint sets Hi,..., Am C X, let Hj be an RKHS on Aj for every j G {1,..., m}. 
Then, based on RKHSs Hi,..., Hm on X defined by Lemma [21 the joined RKHSs can be 
designed analogously to Lemma [3l That is, for an arbitrary index set J C {1,... ,m} and 
a vector A = (Aj)jgj G (0,oo)l'^l, the direct sum 


■■=®H, = lf = Y,fr- fj € 4 for all j G J 
j&j y 3&J 

is again an RKHS equipped with the norm 

ii/iiL = E^ti/,iik- (15) 

KJ 

If J = {1,... , m} we simply write 

H := Hj (16) 

Note that H contains inter aha /d.a given by (|11). Summarizing, we can define another 
assumption set. 
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(H) For every j G {1,... , m}, let Hj be a separable RKHS of measurable kernels kj over 
Aj, where , Am C X are pairwise disjoint and 


\\kj 


||2 

''L2{Px\Aj) 



kj{x,x)dPx\Aj{x) < oo . 


Then we define RKHSs Hi ,..., Hm by Lemma [2] and the joined RKHS H by (|16p 
equipped with the norm (fT^ for fixed Ai,..., Am > 0. 

Having designed a joined RKHS as above, a crucial property of its function’s risks is 
expressed by the following lemma. 


Lemma 4 Let P be a distribution on X x Y and L : X x Y x M ^ [0,oo) be a loss 
function. For A,BcX such that AUB = X and AnB = 0, define loss functions La, Lb ■ 
X xY X M ^ [0,oo) by LA{x,y,t) = lAix)L{x,y,t) and LB{x,y,t) = lB{x)L{x,y,t), 
respectively. Furthermore, let f a ■ X ^ U. as well as fB ■ X ^ W be measurable functions 
and / : X —)• M 6e defined by f{x) = \A{x)fA{x) + 'lB{x)fB{x) for all x ^ X. Then we 
have 


hlL,p{f) = hlLA,p{fA) + HLg^pifs) ■ 


as well as 


7^L,p(/) - = (7^L^,p(/A) - 7^I^,p) + (7^L„p(/s) - . 

Note that Lemma 0] can be transferred to finite, pairwise disjoint unions. To be more 
precise, let us consider an arbitrary index set J C {1,... , m} and define the corresponding 
loss function Lj : X x Y x M ^ [0, oo) by 

Lj{x,y,t) := \yj,^^Aj{x)L{x,y,t). 

Now, it is straightforward to show 

T^Lj,p{f) = y~l^Lpp(/) 

HJ 

for every function / : X —)• M. Based on this generalization and the whole index set 
J = {!,..., m}, let us briefly consider Lemma 0] for the empirical measure D and for 
/d,a = J2]=i^AjfDj,\j, where j = l,...,m, are defined by ([3]). Then, for an 

arbitrary f G H, it immediately follows 

m 

II/d,a|Ih + ^i,D(/D,A) = ^ (Aj ||/Dj,aJ|^^. +^Lj,d(/d,a)) 

i=i 

m 

sE(^AIU/|I*,+kud(/)) 

i=i 

= 11/11^+ 7^L,D(/). (17) 
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That is, /d,a is the decision function of an SVM using H and L as well as the regularization 
parameter A = 1. In other words, the latter SVM equals the VP-SVM given by ([3]). This 
will be a key insight used in our analysis. 

To derive an oracle inequality, i.e. an appropriate upper bound for the excess risk 
’^Lj,p(/d,a) — p some index set J C m}, we have to introduce a few 

more notations. Let P be a distribution on X xY such that a Bayes decision function 
/£ p : V —>■ [—M, M] exists, for some constant M > 0 at which L can be clipped. Moreover, 
we denote by L o / the function (x,y) L{x,y,f{x)). If there exist constants B > 0, 

1 ? G [0,1], and V > such that we have 

L{x,y,t)<B, (18) 

Ep{Lof-Loflpf<V-{Ep{Lof-Loflp)f , (19) 

for all {x,y) £ X x Y, t ^ [—M, M], and f : X ^ [—M, M], we say that the supremum 
bound (fT8)l and the variance bound (fT^ . respectively, is fulfilled. Actually, (fTHll immediately 
yields 


Lj{x,y,t) = \yj,^^Aj{x)L{x,y,t) < L{x,y,t) < B 

for all {x,y) £ X xY and t £ [—M,M], i.e. the supremum bound is also satisfied for Lj. 
Moreover, if (jl9p holds for all / : A —)• [—M,M], the variance bound using the loss Lj is 
satisfied, too. Indeed, by the use of f{x) := + ^x\(U- a p(®) 

all a: G A, we have 

Ep {Ljof-Ljo flpf = Ep (Lj o 7- Lj o /£ p) ' 

= Ep (Lo7-Lo/2p)' 

< V- (Ep (^Lof-Loflp^y 

< V-(Ep {Lj o f - Lj o flp))^ 

for all / : A —?■ [—M, M]. Let us quickly define a third assumption set. 

(P) Let P be a distribution on A x A such that the variance bound (fT^ is satisfied for 
constants "d G [0,1], V > B^~^, and all functions / : A —>• [—M, M]. 


Up to now, there is still m issing a classical tool that is u sed to derive learning rates . 
namel y entropy numbers, see ( Carl and Stephani . 199[lh or ( Steinwart and Christmann . 
2nn8al. Definition A.5.26). Recall that, for normed spaces {E, || • ||e) and (F, || • \\f) as 
well as an integer i > 1, the i-th (dyadic) entropy number of a bounded, linear operator 
S : F —)• F is defined by 


ei{S:E^F) := ei(5Fj;, || • \\f) 


:= inf > 0 : 3si,..., S 2 i-i £ SBe such that SBe C {sj + eBe) j , 
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where we use the convention inf 0 := oo, and Be as well as Bp denote the closed unit balls 
in E and F, respectively. Finally, we present a first oracle inequality involving an upper 
bound for the excess risk 7^Lj,p(/d,a) — '^Lj p> where J C {1,.. . , m} is an arbitrary index 
set. 

Theorem 5 Let L : X x Y x M —?> [0, oo) be a locally Lipschitz continuous loss that can 
he clipped at M > 0 and that satisfies the supremum bound (flH)) for some B > 0. Based 
on a partition of X, where Aj / 0 for every j E {1,... ,m}, we assume (H). 

Furthermore, for an arbitrary index set J C m}, we suppose (P). Assume that, 

for fixed n > 1, there exist constants p E (0,1) and ai,...,Om > 0 such that for all 
j E {1,... ,m} 


ei(id : Hj L2(Px\Aj)) < aj i , i>l. (20) 

Finally, fix an fQ G Ft and a constant Bq > B such that \\Lj o /o||oo < Bq. Then, for all 
fixed r > 0, A = (Ai,..., Am) > 0, and 


a := max 



the VP-SVM given by (H]) using Hi,..., Hm and Lj satisfies 

m 

+'^Lj,p(7d,a) -'^Lj,p 


i=i 


< 


91 All*, -“L,P I +C(a2>>„-1) +3 


with probability not less than 1 — 3e where C > 0 is a constant only depending on p, 
M, V, 1 ?, and B. 


The above theorem deals with the case of a partition with quite a few sets Aj, j E 
{1,... ,m}. However, if we consider a partition consisting of just one set Ai, i.e. Ai = X, 
Theorem [5] is supposed to provide an oracle inequality that is comparable to the already 
known ones. To make that sure, let us briefly consider the case m = 1 and hence Ai := X, 
Ai := A as well as RKHSs Hi = Hi = H over X with || • ||^ = A|| • ||^^. Note that in this 
case we have /d,a = /di,Ai- If (|20|) holds for Hi, Theorem [5] yields that an SVM using H 
and Lj = L satisfies 


+ T^L,p{fD,x) - P*L,p 

/ 2p \ 2 —p —+ 

< 9 (A||/o||l^j + 7 ^l,p(/o) - ^L,p) + C i j + 3 


72Vt\^ 

) ^ 


n 


15Bot 

n 
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with probability P"" not less than 1 — 3e ~^ for fixed r > 0. Note that this o racle inequality 
indeed matches with the one stated in ([Steinwart and ChristmannI . l2008al . Theorem 7.23) 
apart from the constant Cp, which is, however, only depending on p, i?, and B. 

In the folllowing section, we focus on RKHSs using Gaussian RBF kernels and examine 
the associated entropy numbers to specify ([20]). Subsequently in Section |5l we additionally 
consider the least squares loss and adapt the oracle inequality of Theorem [5l 


4. Entropy Estimates for Local Gaussian RKHSs 

In this section, we refine assumption ()20)1 . More precisely, in the subsequent theorem we 
determine an upper bound for the entropy numbers of the operator id : H^[A) —)• L2{Px\a)-: 
where H^{A) is the RKHS over A of the Gaussian RBF kernel on A <Z defined by 

x') := exp (— 7 “^||x — x'lll) , x,x'gA, 

for some width 7 > 0 . 

Theorem 6 Let X C Pjv be a distribution on X and A d X be such that A ^ tb and 
such that there exists an Euclidean ball B C with radius r > 0 containing A, i.e. A d B. 
Moreover, for 0 < 7 < r, let H^{A) he the RKHS of the Gaussian RBF kernel k^ over A. 
Then, for all p E (0,1), there exists a constant Cp > 0 such that 

ei(id : H^{A) L2(Px\a)) < Cp^/P^{A)r 7 i , i>l. 

Obviously, this theorem specifies assumption ()20p . Now, for the Gaussian case we elab¬ 
orate assumption (H) and introduce the following additional set of assumptions. 

(G) Let Ai,... ,Am be pairwise disjoint subsets of X with non-empty interior such that, 
for some fixed r > 0 and every j E {!,... ,m}, sup 3 , 3 ,/g^^. ||x — x '||2 < 2r is satished. 
Furthermore, for every j E {1,..., m}, let Hj := (Aj) be the RKHS of the Gaus¬ 
sian kernel k^. with width 7 ^ E (0, r] over Aj. Gonsequently, for A := (Ai,..., Am) E 
( 0 , 00 )"*, we define the joined RKHS H := ®^^iH.y.{Aj) by (fT 6 |l equipped with the 
norm m- 

Since we do not consider SVMs with a fixed kernel, we use a more detailed notation 
than m and dH) in the following specifying the kernel width 7 ^ of the RKHS H.y.{Aj) at 
hand. For all j E {1,..., m} and 7 := ( 71 ,..., 7 m), we thus write 

1 ” 

/d,-,a,, 7 , = arg min Aj||/||^ , , - V(x*, y*, /(x*)), 

and 

m 

i=i 

instead of /dj,Aj and /d,a in the remainder of this work. 

In the subsequent section, we consider the least squares loss which, together with As¬ 
sumption (G) and TheoremlH allows us to elaborate the oracle inequality stated in Theorem 
[5] so that we finally obtain learning rates. 
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5. Learning Rates for Least Sqnares VP-SVMs 

In this section, the non-parametric least squares regression problem is considered using the 
least squares loss L : Y x M —>■ [0, oo) defined by L{y,t) := {y — It is well known that, 
in this case, the Bayes decision function /£ p : R is given by /2p(x) = Ep(y|x) for 

Pjjf-aliTiost all X E R'^. Moreover, this function is unique up to zero-sets. Besides, for the 
least squares loss the equality 

can be shown by some simple, well-known transformations. Recall that T is a non-empty 
subset of X, where the index set Jt dehned by ([9]) indicates every set Aj of the partition 
of X that shares at least one point with T. The associated loss function 
Ljj, ■. X xY X R —)■ [0, oo) is defined by (fTTI) . 


5.1 Basic Oracle Inequalities for LS-VP-SVMs 


To formulate oracle inequalities and derive rates for VP-SVMs using the least squares loss, 
the target function /£ p is assumed to satisfy certain smoothness conditions. To this end, we 
initially recall the modulus of smoot hness, a device to measure the sin oothness of functions 
(see e.g. ( DeVore and Loreii^. 1993. d. 44h (DeVore and Popov . 1988. d. 3981. and i Br 


serens 


and DeVore. Il978l. p. 360)). Denote by 11-112 the Euclidean norm and let X C R'^ be a subset 
with non-empty interior, v be an arbitrary measure on X, p E (0,oo], and / : X —>• R be 
contained in Lp (p). Then, for s E N, the s-th modulus of smoothness of / is defined by 


= sup ||A^(/, •)||p ^ , 
\\h\\2<t 

where (/, •) denotes the s-th difference of / given by 


t>0 , 




Ei=o Q) (-1)* ^ fix + jh) if X E Xs^h 
0 if X ^ Xs^h 


for h = (hi,..., hd) E [0, oo)'^ and Xg^h '■= {x E X : x -t- th E X f.a. t E [0, s]}. Based on 
the modulus of smoothness, we introduce Besov spaces, i.e. function spaces that provide 
a hner scale of smoothness than the commonly used Sobolev spaces and that will thus be 
assumed to contain the target function later on. To this end, let 1 < p, g < oo, a > 0, 
s := [aJ -|- 1, and v be an arbitrary measure. Then the Besov space {u) is dehned by 


Bp^qiv) ■■= {/ G Lp{iy) : < oo 




where the seminorm | • j^c is given by 




1 < q < OO . 


or 


I/I 


Bot 

P)00 


(u) ■= sup [t if A)) , 

t>0 
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see e.g. ( Adams and Fournier . 20031. Section 7) and ( Triebel . 2O10l. Sections 2 and 3). 
Note that ||/||bc := \\f\\T,„(„) + („) a ctually describes a norm of (t^) for all 


q E [l,oo], see e.g. ( DeVore and Lorenta. [l993l . pp. 54/55) and ( DeVore and PodovI . 119881. 
p. 398). Again, if u is the Lebesgue r neasure on X. we write {X) := ^ {v). For the 

sake of completeness, recall from e.g. ( Adams and Fournierl . l2003f . Section 3) and ( Triebell . 
2010 l. Sections 2 and 3) the scale of Sobolev spaces Wp{v) defined by 


Wp {u) := |/ E Lp {v) : f E Lp {u) exists for all /? E Nq with |/3| < a| , 


where a E No, 1 < p < oo, v is an arbitrary measure, and <9^^^ is the /3-th weak derivative 
for a multi-index /? = (/3i,... ,/3d) E Ng with |/3| = Ylt=i That is, Wp{i') is the space 
of all functions in Lp{v), whose weak derivatives up to order a exist and are contained in 
Lp{v). Moreover, the Sobolev space is equipped with the Sobolev norm 


p 

W^iu) 


l/3|<« 


dWf 


(cf. lAdams and Fournieii . l2003l . page 60). We write Wp(i') = Lp{y) and, for the Lebesgue 
measure p on X r we define W^iX^ := (fi). It is well-known, see e.g. (' Edmunds 

and Triebel, Il996l . p. 25 and p. 44), that the Sobolev spaces VF"(M'^) fall into the scale of 
Besov spaces, namely 


W;(M'') C 

for a E N, p E (1, oo), and max{p, 2} < q < oo. Moreover, for p = q = 2 we actually have 
equality, that is kF^(M‘^) = B 2 2 i^'^) with equivalent norms. 

Based on the least squares loss and RKHSs using Gaussian kernels over the partition 
sets Aj, the subsequent theorem refines the oracle inequality stated in Theorem [5j 


Theorem 7 Let Y := [—M,M] for M > 0, L : Y xM. ^ [0, oo) be the least squares loss 
and V be a distribution on x Y. We write X := suppPx- Furthermore, let (A) and 
(G) be satisfied. In addition, for an arbitrary subset T C X, we assume (T). Moreover, 
let ff-p ■. 'Efi ^ 'R be a Bayes decision function such that //p E L 2 (M'^) n Loo{E‘^) as 
well as ffp E B 2 ^(Px\At) some a > 1. Then, for all p E (0,1), n > 1, r > 1, 
7 = ( 71 ,..., 7 m) E (0,r]”^, and A = (Ai,...,Am) > 0, the VP-SVM given by (jl]) using 
H.yj^{Ai),..., H^^{Am), and the loss Lj.^ satisfies 


X] Aj 11 /d^. , a, , 7 , 11 (A .) + ,p (/d. a,7 ) - ,p 




< Cm,ol,p j Aj7j '^+ 

\3&Jt 


-d , / maxjej,^ 7 ^- 


d-\-2p 


. , max7^"-hr^^ y^A-S,- ^ PA:(^j) n Vrn ^ 

\mmjejT7jJ 36Jt ^ \~^i ' 


with probability P"" not less than 1 — e where CM,a,p > 0 is a constant only depending on 
M, a, p, d, ||/ 2 _pIIl 2 (R'*)’ II/l,pIIloo(R‘^)' 
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Using this oracle inequality, we derive learning rates w.r.t. the loss Lj^ for the learning 
method described by ([3]) and ([H) in the following theorem. 


Theorem 8 Let t >1 he fixed and j3 > ^ + 1. Under the assumptions of Theorem\^ and 
with 

_i_ 

Tn = cin , 

K,j = C2r‘^n ~^, 

1 

ln,j = Csn , 

for every j G {1,... , we have, for all n> 1 and ^ > 0, 

/-N 2a I ^ 

with probability P” not less than 1 — e“’’, where A„ := (An,i, • • • ,)^n,m„) as well as := 
( 7 n,i, • • •) 7 n,m„) and C, Cl, C 2 , C 3 are positive constants with C 3 < ci. 

In the latter theorem the condition fi > ^ + 1 is required to ensure jn,j < J = 

1,... ,m„, which in turn is a prerequisite arising from Theorem [U] and the used entropy 

„ _E 

estimate. Let us briefly examine the extreme case fi = + Using Vn ^ n and dZD leads 

to covering numbers of the form rUn ~ and computational costs of = 

. 2aqfi-d , 

0{n j which is actually less than the computational cost of order n^, q G [2,3], of an 
usual SVM. Note that for increasing /3 the computational cost of an VP-SVM is increasing 
as well. However, for /3 > -^ + 1, n , and « n^, a VP-SVM has costs of 

1 + (/ 3 - 1 ) 9 . 

0[n ^ ) which still is less that O (n*?). 

Let us finally take a closer look at the VP-SVM given by (j4]) and the considerations 
related to (ini), where /d,a ^ H = ©JLi solves the minimization problem 

m m 

/d,a = argmin ^ ^ Xj \\fj\\%, + TZ^d ( fj) ■ 

j = \ J —1 


( 21 ) 

( 22 ) 

(23) 


Choosing Ai = ... = Am, the VP-SVM problem can be understood as t' 2 -multiple kernel 
learning (MKL) problem usin g the RKHSs Hi,. . Learning ra,tes for MKL have 

been treated, for example, in ( Suzuki . 2011 1 and ( Kloft and Blanchard . 2 OI 2 I 1 . Assuming 
ffp^H, the learning rate achieved in (Suzuki, l2nnli i s mn 1+^^ for dense s e ttings , where 
s is the so-called spectral decay coefficient. In addition. iKloft and BlanchardI ()2012l l obtain 
essentially the same rates un der these assumptions. Let us therefore briefly investigate the 
above rate of (|Suzukil. 1201111 . For RKHSs that are continuously embedded in a Sobolev 
space we have s = -^ such that the learning rate reduces to mn 2 a+d. Note that 

2a 

this learning rate is m times the optimal learning rate n 2 Q+ti ^ where the number m = mn 
of kernels may increase w ith the sampl e size n. In particular, if mn —)• 00 polynomially, 
then the rates obtained in ( Suzuki . 201 il l become substantially worse than the optimal rate. 
In contrast, due to the special choice of the RKHSs, this is not the case for our VP-SVM 
problem, provided that mn does not grow faster than . 
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X 


Note that the oracle inequalities and learn¬ 
ing rates achieved in Theorems [7| and [8] require 
/l,p £ However, for an in¬ 

creasing sample size n, the sets Aj shrink and the 
index set Jt, indicating every set Aj such that 
AjCiT / 0 and T C Ujej^ increases. In par¬ 
ticular, this also involves that the set Uje Jt 
covering T changes in tandem with n. Since this 
is very inconvenient and since it would be de¬ 
sirable to assume a certain level of smoothness 
of the target function on a fixed region for all 
n € N, we consider the set T enlarged by an 
(5-tube. To this end, for 5 > 0, we define by 

:={x£X : gT such that \\x — t \\2 < (5} , 

(24) 

' ' Figure 5: An input space X with the corre- 

, o ^ spending Voronoi partition as well as a subset 

which implies T C r+ C X, cf. Figure E Note j. ^ ^ 

that, for every (5 > 0, there exists an G N 

such that for every n > ns the union of all partition sets Aj, having at least one common 
point with T, is contained in T~^^, i.e. 



V(5 > 0 3n5 G N Vn > n 5 : |J Aj C T+^ , 

is Jt 


(25) 


where Tp := {j G {1, ..., m^} : Aj Cl T / 0}. Collectively, this implies 


T C U ^i ^ 
is Jt 


for all n > ns- Furthermore, since every set Aj is contained in a ball with radius Xn = cn ^, 
the lowest sample size ns in (l25)) can be determined by choosing the smallest G N such 
that 5 > 2rng with as in ([7]), that is 



This leads to the following corollary where we present an oracle inequality and learning 
rates assuming the smoothness level a of the target function on a fixed region. 


Corollary 9 Let V := [—M,M] for M > 0, L :Y xM—>■ [0,oo) be the least squares loss, 
and V be a distribution on x Y. We write X := suppPjc- Furthermore, let (A) and 
(G) be satisfied. In addition, for an arbitrary subset T G X, we assume (T). Moreover, let 
/£ p : M 6e a Bayes decision function such that /£ p G L 2 (M‘^) n as well as 

flp G 
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for a > 1 and some 6 > 0. Then, for all p G (0, 1), n > ns, r > 1, 7 = ( 71 ,... , 7 m) £ 
(0, r]”*, and X = (Ai,..., Am) > 0, the VP-SVM given by dH) using (^ 1 ), ■ ■ ■, -^ 7 ™ (^m), 
and the loss Lt satisfies 


X] ^ill/D,7A„7, + ^i-T,p(/D,A,7) - ^Lt,P 

i=i ^ 


< Cm, 


OL^P 


jeJr 


/ maxjg Jy 7i 
V minjeJT 7i 


d 


max7^"+r^^ 
ieJT ^ 


^ _d+2p 

ii=i 


jj 


n ^+rn ^ 


with probability P” not less than 1 — e~'^, where CM,a,p > 0 is the same constant as in 
Theorem 

Additionally, let (3 > '^ + 1 as well as, for every j G {1,... ,m„}, r„, Xn,j, and 'yn,j be 
as in (EB, (EH), and (l23|l . respectively, where ci,C 2 ,C 3 are user-specified positive constants 


with C 3 < Cl. Then, for all n > ns = 




and f, > Q, we have 


7^L„p(/D,A„,7J -n,,P < C'Tn-^+« 

with probability P” not less than 1 — e~'^, where Xn ■= {Xn,i, ■ ■ ■, Xn,m„), In ( 7 n,i, • • •, 
7n,m„), and C is a positive constant. 


Note that the assumption /£ p G -B 2 oo(Px|t+' 5) made in Corollary [9] is satisfied if, for 
example, /£ p G and Px has a bounded Lebesgue density on T~^^. Moreover, 

if this density is even bounded away from 0, it is well-known that the minmax rate is 

2a 

fi 2 a+d for a > d/2 and target functions /£ p G W 2 (T). Modulo our rate is therefore 
asymptotically optimal in a minmax sense on T. In addition, for a > d, the learning rates 
obtained for /£ p G i? 2 oo(^) again asymptotically optimal modulo ^ on T. 


5.2 Data-Dependent Parameter Selection for VP-SVMs 


Note that in the previous theorems the choice of the regularization parameters A^^i,..., 
Xn,m,n and the kernel widths jn,!, ■ ■ ■ ■,ln,mn requires us to know the smoothness parameter 
a. Unfortunately, in practice, we usually do know neither this value nor its existence. In this 
sub section, we thus show that a train ing/validation approach simil ar to the one examined 
in (ISteinwart and ChristmannI . I2nn8al . Chapters 6.5, 7.4, 8.2) and (jEberts and Steinwartl . 
201311 achieves the same rates adaptively, i.e. without knowing a. For this purpose, let 


A := (A„) and P := (P^) be sequences of finite subsets A, 
data set D := {{xi,yi), ..., (x„, yn)), we define 


C (0,r^] and P^ C (0, r^]. For a 


A»i := {{xi,yi),...,{xi,yi)), 

A>2 := {{xi+i,yi+i), ..., {xn,yn)), 


where I := + 1 and n > 4. We further split these sets in data sets 

:= {{xi, yi) G Di : Xi G Aj} , j G {1,..., m„} , 
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Df^ := {{xi,yi) £ D2:Xi£ Aj} , j £ {1 ,... ,m„} , 

and define Ij := for all j £ mn} such that Y^^=ih ~ every j £ 

{ 1 ,... ,mn}, we basically use as a training set, i.e. based on Di in combination with 
the loss function Lj := 'iiAjL we compute SVM decision functions 

A ^’^Smin Aj||/||| + 7^L^,Dl(/), (A^, 7 ^) G A„ x r„ . 

Again, note that A (p , = 0 if = 0. Next, for each j, we use D 2 in tandem with L,,- 

,Aj,7j J J 

(<2\ 

(or essentially D^- ) to determine a pair (AD 2 ,j, 7 D 2 ,j) £ x such that 




,D 2 


Dj AD2J-7D2,i 


min 

(Aj ,7j leA^i X r 71 


A 1 l„D 2 /n(i) 



Finally, combining the decision functions A (p , for all j G {1,..., nin}, and defining 

Uj ,Ad2j,7D2J 

Ad 2 := (Ad 2 ,i, ■ • ■, AD 2 ,m 77 ) and 7^2 := ( 702 , 1 , • • • , 7 D 2 ,m, 7 ), we obtain a function 


/DijAdo 


,7d. 


Et 

i=i 


D*:'^\AD2,i,7D2,j 


E»-«7, 

i=i 


n7) \ 
u,- ,AD 2 ,i 


,7Do 


and we call every learning method that produces these resulting decision functions 
/di,Ad 2 , 7 d ^ training validation Voronoi partition support vector machine (TV-VP-SVM) 
w.r.t. A X r. Moreover, using ([5]) we have, for A := (Ai,..., Am„) and 7 := ( 71 ,... ■,'ymn)-, 


^L,D2 (/Di,Ad2,7dJ = ^ 


i=i 


A 02,3 ^1-02,3 


V min 7^L . Da ( L 

^(A2,7j)6A7.xr,7 M ] 


min > Do I f, 

7A ^m7,, I 


Df\A,,7i 


(A,7)e(A„xr,7)’"-^ 

.7 = 1 


min 7 ^L,D 2 /di,a ,7 , 

(A,7)e(A„xr,7)"*- ^ 


where /di,a ,7 := Ej^i^ with {Xj,jj) £ An X Tn for all j £ m^}. In 

other words, the function /di,Ad 2 , 7 d 2 ^sally minimizes the empirical risk 7 ^l,D 2 w.r.t. the 

validation data set D 2 and the loss L, where the minimum is taken over all functions /di,a ,7 
with (A, 7) G {An X Vn)'^"- 

The following theorem presents learning rates for the above described TV-VP-SVM. 


Theorem 10 Let rn '■= cn with constants c > 0 and /3 > 1. Under the assumptions 
of Theorem we fix sequences A := {An) and P := (P^) of finite subsets An C (0, r(J] 
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and C (0,r„] such that A„ is an {r'^£n)-net of (0,rjJ] and r„ is a 5n-net of (0, r^] with 
_ 

En < and 5n < n 2 +ti. Furthermore, assume that the cardinalities |A„| and |r„| grow 
polynomially in n. Then, for all ^ > 0, t > 1, and a < ^^d, the TV-VP-SVM producing 
the decision functions /di,Ad 2 , 7 d 2 satisfies 


,p(/di, 


^D2,7d2 


- 7 ^ 




< crn 2 a+d 


+« 


> 1 - e“ 


where c> t) is a constant independent of n and r. 


Once more, we can replace the assumption /£ p € B 2 ^(Px\At) /l p ^ oo(P^|T+'5) 

for some <5 > 0 and obtain the same learning rate as in Theorem [10] for all n > ns although 
is fixed for all n E N. Note that, if Px has a Lebesgue density that is bounded 
away from 0 and oo and either /£ p € (T) for a > fi/2 or /£ p E for a > d, 

these learning rates are again asymptotically optimal modulo on T in a minmax sense. 
However, the condition a < ^^d restricts the set of a-values where we obtain learning 
rates adaptively. To be more precise, there is a trade-off between a and fd. On the one 
hand, for small values of j3 only a small number of possible values for a is covered. On the 
other hand, for larger values of fd the set of a-values where we achieve rates adaptively is 
increasing but the savings in terms of computing time is decreasing. 


6. Experimental Results 

In the previous sections we defined VP-SVMs and derived local learning rates that are 
essentially optimal. So far, it is, however, not clear if the theoretical results suggesting a 
generalization performance not worse than that of global SVMs can be empirically confirmed 
and if the predicted advantages of VP-SVMs in terms of computational costs are preserved in 
practice. Note that the latter is not as obvious as it may seem to be, since VP-SVMs create 
an overhead when generating the working sets, and the working sets themselves do not need 
to be as balanced as we assumed in our naive analysis. In this section, we thus investigate 
the performance of VP-SVMs empirically. Namely, we carry out some experiments using 
the least squares loss with the objective to answer the subsequent questions: 

(1) How do different radii affect the performance of VP-SVMs? In particular, what is the 
impact on the training time and the VP-SVM’s test error? 

(2) How do the VP-LS-SVMs perform compared to the usual LS-SVMs in terms of the test 
error? What is the speed-up? 

(3) How does the performance of VP-SVMs compare to vanilla data splitting approaches 
such as random chunking (RC-SVM), in which the data set is devided into a random 
partition with equally sized subsets, and the hnal decision function is the average of 
the SVMs computed on each subset? 

(4) How does the VP-LS-SVM behave compared to the global LS-SVM, if the regression 
function has interruptions of its smoothness on zero sets? 
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(c) Artificial data Type III: jagged function 

Figure 6: Unsealed basic functions used to generate the artificial data sets. 
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To address these questions we utilize two kinds of data sets. On the one hand, to 
answer questions (1), (2), and (3), we examine the three real data sets COVTYPE, ijcnnI , 
and COD-RNA, which we obtained from LIBSVM’s homepage, see (jChang and Linl . 120111 ). 
Tabled] summarizes some characteristics of these data sets. On the other hand, we generated 


data set type 

full data set size 

dimension 

number of labels 

COVTYPE 

581012 

54 

2 

COD-RNA 

488 565 

8 

2 

IJCNNl 

141691 

22 

2 


Table 1: Characteristics of the considered LIBSVM data sets. 


several artificial data sets to address the last question. In order to prepare the data sets 
for the experiments, we edited the data sets from LIBSVM in the following manner. If 
for a real-world data set type the raw data set was already split, we first merged these 
sets so that we obtained one data set for each data set type. In a next step, we scaled 
the data componentwise such that all samples including labels lie in [—1, where d 

is the dimension of the input data. Finally, for each data set type, we generated random 
subsets that were afterwards randomly splitted into a training and a test data set. In this 
manner, we obtained, for each of the three LIBSVM data set types, training sets consisting 
of re = 1000, 2 500, 5 000, 10 000, 25 000, 50 000, 100 000 samples. Additionally, for the 
data sets COVTYPE and COD-rna, we created training sets of sizes 250 000 and 500 000, 
and of sizes 250000 and 400 000, respectively. The test data sets associated to the various 
training sets consist of re-test = 50 000 random samples, apart from the training sets with 
ntrain < 5 000, for which we took retest = 10 000 test samples. 

For the artificial data, we proceeded in a slightly different way. To generate the data sets 
we took as fundament the five regression functions pictured in Figure[6|and as noise, the sum 
of two uniform distributions on [—c(x), c(x)], where c(x) = j (3sin (f |a:|) -|- l) for the one¬ 
dimensional data sets and c{x) = | (sin (|xi| -|- |x2|)) + l) for the two-dimensional data 
sets. Thus, we produced five different types of artificial data sets, where the various data 
set types are named according to their type numbers as in Figure [6| Initially, we created 
two sets, namely one training and one test data set, each consisting of 10 000 random input 
samples contained in [—1,1] and [—1,1]^, respectively. Then, for each artificial data set 
type, we determined the labels belonging to the input data as sum of the corresponding 
functional value and the noise and, finally, scaled all 20 000 labels to [—1,1]. In a last 
step, we randomly built subsets of the training sets of size re = 1000, 2 500, 5 000. In this 
way, we altogether obtained, for each type of artificial data, four training data sets of size 
re = 1000, 2 500, 5 000, 10000 and a corresponding test data set of size retest = 10 000. 
Based on the test data sets the Bayes risks can be determined, see Table [2| where the Bayes 
risks are summarized for the various artificial data set types. 

Type I Type II Type III Type IV Type V 
Bayes risk 0.0254 0.0137 0.0529 0.0083 0.0634 

Table 2: Bayes risks w.r.t. test data sets for the various artificial data set types. 
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To minimize random effects, we repeated the experiment for each setting several times. 
Since experiments using large data sets entail long run times, we reran every experiment 
using a training set of size n > 50 000 only three times while for training sets of size 
n = 10 000, 25 000 we performed ten repetitions and for smaller training sets, namely of 
size n = 1000, 2 500, 5 000, even 100 runs. An exception are the experiments using artificial 
training sets of size n = 10 000, where we realized 100 repetitions for the sake of uniformity. 

To approach the above problems we used the least squares loss and Gaussian kernels for 
all e xrjeriments. We implemented an LS-SVM-solver in C++ similar to the one in f Steinwart 
et ah. l201lh . Around this solver, we then built the routines for the VP-SVM and the RC- 
SVM. The compilation of the three programmes was executed by LINUX’s gcc. To produce 
comparable results in terms of run time, all real-world data experiments were realized by the 
same professional compute servei0 equipped with four INTEL XEON E7-4830 (2.13 GHz) 
8-core processor, 256 GB RAM, and a 64 bit version of Debian GNU/Linux 6.0.7. In order 
that we can indeed compare their run time, we used eight cores to pre-compute the kernel 
matrix and to evaluate the final decision functions on the test set, and one core for the 
subsequent solver for every real data experiment. Since the artificial data sets consist of at 
most 10 000 samples we performed the according experiments by a computer equipped with 
one INTEL CORE i7-3770K (3.50 GHz) quad core processor, 16 GB RAM, and a 64 bit 
version of Debian GNU/Linux 6.0.7. Eor all artificial data experiments we used four cores 
to pre-compute the kernel matrix and to evaluate the final decision functions on the test set, 
and again one core for the solver. Even with pre-computed kernel matrices, our experiments 
on the real-world data altogether required almost 810 hours (approximately 34 days) for 
training and additionally almost 4 days for testing. Moreover, the experiments on the 
artificial data took nearly 43 hours for training and 168 minutes for testing. Without pre- 
computing the kernel matrices, e.g. by applying a standard caching approach, preliminary 
experiments suggested a multiplcation of the training time, which would have rendered the 
experiments infeasible. Besides, our experiments will show that the available amount of 
RAM does not restrict the size of the training sets used by an VP-SVMs as severely as the 
ones used by LS-SVMs. 

Let us quickly illustrate the routines of the VP- and the RC-SVM implemented around 
the LS-solver. Eor the VP-SVM, we first split the training set by Algorithm [1] in several 
working sets representing a Voronoi partition w.r.t. the user-specified radius. Eor this 
purpose. Algorithm [1] initiall y determines a c over of the input dat a applying the farthest 
first traversal algorithm, see ( Daseupta . 2008ll and ( Gonzalez . 1985 1 for more details. Note 
that this procedure induces working sets whose sizes may be considerably varying. In the 
case of an RC-SVM the working sets are created randomly, where their sizes are basically 
equal and the number of working sets is predefined by the user. Then, for the VP-SVM- as 
well as for the RC-SVM-algorithm the implemented LS-solver is applied on every working 
set. Eor each working set, we randomly split the respective training data set of size ntrain in 
five folds to apply 5-fold cross-validation in order to deal with the hyper-parameters A and 

^ II II V I I 1 .' t • '/, 

train ' 


7 taken from an 10 by 10 grid geometrically generated in [0.001 0.1] x [0.5-n^j,V^, 10]. 


1. On this occasion, we would like to thank the Institute for Applied Analysis and Numerical Simulation of 
the University of Stuttgart, who placed the above mentioned compute server at our disposal and, thus, 
enabled us to realize our experiments on large real-world data sets. In consequence, the overall time 
available for our experiments was limited. 
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Algorithm 1 Determine a Voronoi partition of the input data 

Require: Input data set Dx = {^i,..., x„} with sample size n G N and some radius r > 0. 
Ensure: Working sets indicating a Voronoi partition of Dx- 
1: Pick an arbitrary z G Dx 
2: Coveri •<— z 
3: m 1 

4: while m.aXx^Dx II® ~ Cover ||2 > r do 
5 : 2 ■(— argmaxa;g£)j^ \\x — Cover||2 

6: m •(— m + 1 

7 : Coverm -^r- z 

8: WorkingSet^ ^ 0 

9: end while 
10: for z = 1 to n do 
11 : k •(— argminjg|;^^ \\xi — Coverj \\2 

12: WorkingSetj^ ^ WorkingSetf. U {xi} 

13: end for 

14: retnrn WorkingSeti,..., WorkingSet^j^ 


Finally, we obtain one decision function for each working set. To further process these 
decision functions the VP-SVM-algorithms picks exactly one decision function depending 
on the working set affiliation of the input value. On the contrary, the RC-SVM-algorithm 
simply takes the average of all the decision functions. Moreover, since we scaled the labels of 
all data sets to [—1,1], the computed decision functions are clipped at ±1. Altogether, note 
that the usual LS-SVM-algorithm can be interpreted as special case of both the VP-SVM- 
and the RC-SVM-algorithm using one working set. 

The experimental results for the three real data sets are summarized in Tables [3] to [H 
These tables as well as Tables HtodH containing the results for the experiments on the 
artificial data sets, can be found in the Appendix. In addition to the average run times 
of the training and test phases, these tables reflect inter alia the average test errors of 
the empirical SVM solutions. Additionally, the L2-errors of the empirical SVM solutions, 
i.e. the value of 


1 

^ ^test 


ntest 

y ^ ^/D,A,7(®testj) ~ f£ p(xtesti)'j 


2=1 


is determined for the artificial data sets. Moreover, note that some of the result tables are 
incomplete for very large real-world training data sets. In these cases, the kernel matrix, 
whose size depends on the training set size, did not fit into the RAM of the used computer 
and, thus, these experiments were left out. 


6.1 Experiments on Real-World Data 

In this subsection, we adress questions (1), (2), and (3) by examining the results for the 
real-world data sets COVTYPE, COD-rna, and uCNNl, which are composed in Figures [7H9] 
and Tables [SHSl 
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6.1.1 Comparison of VP-SVMs Using Different Radii 


In the following, we focus on the VP-SVMs using four different radii for the various real- 
world data sets, where the experimental results are summarized in Tables [3H6] as well as in 
Subfigures (d) - (f) of Figures OM Examining the achieved training times for each data set 
type, we observe that, for increasing training set sizes, the radius that leads to the shortest 
training time typically decreases. More precisely, for the real data sets with sample size 
ntraia > 10 000, the VP-SVMs using the smallest radius always train fastest, while for the 
data sets with retrain < 10 000, we can not make a uniform statement. Clearly, this finding 
is not surprising, since an SVM for a small data set trains considerably faster than an SVM 
for a large data set, such that splitting the large data set and running an SVM for each of 
the small data sets may altogether still be faster. Recall additionally the considerations in 
terms of the computational cost made in Section [2j 


Let us now consider the VP-SVM results in terms of the realized test errors. As expected, 
for the real-world data sets, the test errors achieved by VP-SVMs with fixed radii decrease 
with increasing sample size of the used data sets except twice. In addition, for the real data 
sets COVTYPE and COD-rna, the test errors decrease for increasing radii, cf. Figures [7(f ) | and 
|8(f)[ Here, however, the test errors achieved for the various radii get close to each other with 
increasing training sample size. The same behavior of the test errors appears for the uCNNl 
data set, though, for retrain > 5 000, both intermediate radii yield even smaller empirical 
risks than the largest radius, see Figure |9(f)[ In consequence, it is not straightforward to 
draw any conclusion on the relation between radius and test error. Nonetheless, we can 
say that VP-SVMs using small radii enjoy test errors that are never significantly larger and 
somethimes even smaller than those of VP-SVMs using the largest of the applied radii. 


Besides, Tables [3] and [5] or Figures 7(e)|(f) and 8(e)|(f) contain an additional finding. For 
large data sets, namely for the COVTYPE data set of size 500 000 or for the COD-rna data 
sets of size 250000 and 400 000, the VP-SVMs with large radii did not yield any solution, 
since they failed due to the technical requirements caused by the used computer. More 
precisely, in these cases, there was at least one working set such that its kernel matrix did 
not fit into the RAM any more. Fortunately, the working sets of VP-SVMs using smaller 
radii were small enough such that we still received an outcome. What is more, these VP- 
SVMs yielded a better empirical risk in partially less training time compared to VP-SVMs 
with large radii and training sample sizes that still allowed a successful performance. That 
is, using a small radius for the VP-SVM and a training set that is oversized for VP-SVMs 
with a larger radius reduces the test error. More precisely, a large training set is crucial for 
a small empirical risk, where the possibly arising computational restrictions can be eluded 
by a VP-SVM with an appropriate radius. 


All in all, localized SVMs using some small radius lead in substantially less training 
time to either negligble worse or even better test errors than VP-SVMs with large radii, if 
the training sample size is adequate, i.e. retrain > 5 000. In addition, the real data sets with 
a large sample size demonstrate that VP-SVMs with small radii are able to conquer the 
technical restrictions caused by the used computer and thus yield a better empirical risk 
than VP-SVMs with bigger radii can attain at all. 
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6.1.2 Comparing VP-SVMs with Global LS-SVMs 

In the following, we compare the results of the VP-SVM using different radii to the standard 
LS-SVM. For the real-world data sets COD-rna and ijcnnI, the VP-SVMs, based on the 
largest of the applied radii, use only one working set. Thus, they coincide with the standard 
LS-SVM modulo different values generated by the random number generator. To verify this 
fact, we compare for the real data sets COD-rna and uCNNl the results of the VP-SVMs 
using one working set to the results of the standard LS-SVMs, see Tables [5] and [H Here, we 
note that the LS-SVM test errors typically decrease with increasing training sample size. 
The same holds for the VP-SVMs using one working set. Moreover, the latter VP-SVM and 
the LS-SVMs perform equally well in terms of training time and empirical risk, however, 
for ritrain > 25 000 the VP-SVMs train slower. 

In practice, a crucial problem is caused by the run time required by an algorithm. 
Hence, for each data set type, we compare hereafter the LS-SVM to the VP-SVM that 
trains fastest for the largest training data set. The required average training times and 
the average test errors of these SVMs are illustrated in Subfigures |(g)f[(i)] of Figures 
First, we notice that the selected VP-SVM uses the smallest of the applied radii for each 
data set type. Besides, the LS-SVM’s test errors are lower than those of the VP-SVMs. 
However, with increasing training set size the VP-SVM’s test errors get close to the ones 
of the LS-SVM. Moreover, for the ijcnnI data set of size 100000, their empirical risks 
even coincide, cf. Figure 9(i) Besides, the VP-SVMs train considerably faster than the 
LS-SVMs. In particular, for ntrain = 100000 the VP-SVMs require at most 8.5% of the 
LS-SVM’s training times, see Figures 7(h), 8(h) and |9(h)} Finally, recall that, for data sets 
of size ntrain > 250 000, the LS-SVM problem is infeasible with our computer, just like the 
VP-SVMs using the largest of the applied radii. In contrast, for ntrain > 250 000, VP-SVMs 
using small radii usually train considerably faster and achieve lower test errors than the 


LS-SVMs for ntrain = 100000, cf. Figures 7(h)-(i) and 8(h)-(i) 


Concluding, we have seen that the application of a VP-SVM using a small radius instead 
of the standard LS-SVM reduces the run time considerably entailing at most a negligible 
worsening or even an improvement of the test errors. Moreover, applying VP-SVMs with 
sufficiently small radii enables us to use large data sets and, thus, to elude the computational 
restrictions to sufficiently small data sets. As a result, handling really large data sets with 
the help of suitable VP-SVMs can lead to significantly improved test errors compared to 
an LS-SVM setting with memory constraints. 


6.1.3 Comparison of VP-SVMs with RC-SVMs 

First of all, let us investigate the RC-SVM results that are composed in Tables HHS as well 


as in Subfigures (a) ^ (c) of Figures [THHl For the real data sets COVTYPE we considered ten. 


for the data sets COD-rna nine, and for the data sets uCNNl eight different numbers of 
working sets. In each case, we started with an RC-SVM using one working set, i.e. with an 
RC-SVM that corresponds to the global LS-SVM modulo different values generated by the 
random generator, cf. Tables [5] and [H Comparing for every data set the RC-SVMs using 
various numbers of working sets, we observe that the number of working sets, minimizing 
the RC-SVMs training time, increases in tandem with the sample size. Moreover, the RC- 
SVM using one working set never trains fastest compared to the other RC-SVMs using 
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more than one working set. Fnrthermore, the average test errors for the applied RC-SVMs 
usually decrease for a decreasing number of working sets and, hence, are minimized by the 
smallest possible number of working sets. Of course, all these findings are not surprising, 
since RC-SVMs are typically used to reduce the training time. 


Let us now compare the results of VP- and RC-SVMs using roughly the same number 
of working sets, cf. Tables [HHEl Initially note that, even though we consider VP- and RC- 
SVM based on the same number of working sets, the RC-SVM working sets are about the 
same size whereas the VP-SVM working sets may have different sizes with a large range. 
That is, the VP-SVMs often deal with a few substantially larger working sets than the 
RC-SVMs. Consequently, the RC-SVMs often perform faster than the VP-SVMs, which 
require up to five times the RC-SVM’s training time for ntrain = 100000. Contrarily, the 
average empirical risks achieved by the VP-SVMs are substantially lower than those of the 
RC-SVMs. Besides, in a few cases the VP-SVMs possess at least one working set which is 
oversized for the computer’s RAM, so that these VP-SVM problems are infeasible, whereas 
the comparable RC-SVMs avoid this conflict. Here, consider e.g. the RC-SVM using seven 
working sets and the VP-SVM with radius r = 4 for the COVTYPE data set of size 500 000. 

In Section 16.1.21 we compared for each data set type the LS-SVM with the VP-SVM 
that trains fastest for the largest training data set. Here, we additionally compare this VP- 
SVM to the RC-SVMs. To be able to draw a fair comparison in terms of the achieved test 
errors, we choose those RC-SVMs that train roughly as fast as the VP-SVM for the largest 
training set, i.e. the slowest RC-SVM training faster and the fastest RC-SVM training 
slower than the above VP-SVM. Subfigures (g) - (i) of Figures [THE] illustrate the average 
training times and the average test errors of these RC-SVMs, the above VP-SVM, and the 
LS-SVM. Considering the RC-SVMs, the faster of the two requires for ntrain = 100000 
between 51% and 83% of the VP-SVMs training time and trains at most seven minutes 
faster than the VP-SVM. However, at least for ntrain > 5 000, both considered RC-SVMs 
induce substantially higher test errors than the VP- and LS-SVM. Finally, note that VP- 
SVMs for ntrain > 250000 considerably outperform LS-SVMs for ntrain = 100 000, while 
RC-SVMs for ntrain > 250 000 lead to even worse test errors than the considered LS-SVMs. 


Summarizing, we record that RC-SVMs using as few as possible working sets achieve 
the smallest RC-SVM test errors, however, those using more working sets perform faster. 
Furthermore, compared to VP-SVMs using roughly the same number of working sets as the 
RC-SVMs, the latter ones may learn faster though not as good as the VP-SVMs. Moreover, 
considering RC-SVMs that require roughly the same training time as the fastest VP-SVM, 
we saw that the RC-SVMs lead to much higher empirical risks. That is, if the required 
training time is a hard constraint, then the VP-SVM that satisfies this constraint achieves 
a better test error than a RC-SVM that also trains fast enough. 


6.2 Experiments on Artificial Data 

It remains to address the last question. To this end, we consider the results on the various 
artificial data sets, on the one hand, for the LS-SVM and, on the other hand, for the VP- 
SVM performing fastest for ntrain = 10000. Moreover, for the sake of comparability, we 
again add to this selection the two RC-SVMs training roughly as fast as the VP-SVM for 
the artificial data sets of size ntrain = 10 000. However, for the artificial data sets of Type I, 
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(a) Average training time of 
the various RC-SVMs for 

ntrain < 10 000 



sample size 


(d) Average training time of 
the various VP-SVMs for 

ntrain < 10 000 



sample size 

(g) Average training time of 
LS-, VP-, and RC-SVMs for 

ntrain < 10 000 



(b) Average training time of 
the various RC-SVMs for 
retrain ^ 5 000 



(e) Average training time of 
the various VP-SVMs for 
retrain ^ 5 000 



(h) Average training time of 
LS-, VP-, and RC-SVMs for 
ntrain > 5 000 



various RC-SVMs 



various VP-SVMs 



(i) Average empirical risk of LS-, 
VP-, and RC-SVMs 


Figure 7: Average training time and test error of LS-, VP-, and RC-SVMs for the real-world data 
COVTYPE depending on the training set size Utrain = 1 000,..., 500 000. Subfigures (a) - (c) show 


the results for RC-SVMs using different numbers of working sets and Subfigures (d) - (f)| illustrate 
the results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) contain the average 
training times and the average test errors of the LS-SVM, one VP-SVM and two RC-SVMs. Here, 
the VP-SVM is the one which trains fastest for ntrain = 500 000 and the two RC-SVMs are those 
which achieve for ntrain = 500 000 roughly the same training time as the chosen VP-SVM. Here, 
note that, for ntrain = 10 000, the RC-SVM using one working set trains substantially slower than 
the LS-SVM, even though this RC-SVM is basically an LS-SVM. As a reason for this phenomenon, 
we conjecture that the used compute server was busy because of other influences. 
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(a) Average training time of 
the various RC-SVMs for 
ntrain < 10 000 
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(d) Average training time of 
the various VP-SVMs for 
ntrain < 10 000 



(b) Average training time of 
the various RC-SVMs for 
ntrain > 5 000 



(e) Average training time of 
the various VP-SVMs for 

Utrain ^ 5 000 



various RC-SVMs 



various VP-SVMs 



(g) Average training time of 
LS-, VP-, and RC-SVMs for 

ntrain < 10 000 



(h) Average training time of 
LS-, VP-, and RC-SVMs for 

ntrain ^ 5 000 



VP-, and RC-SVMs 


Figure 8: Average training time and test error of LS-, VP-, and RC-SVMs for the real-world data 
COD-RNA depending on the training set size ntrain = 1 000,..., 400 000. Subfigures (a) ■ (c) show 


the results for RC-SVMs using different numbers of working sets and Subfigures (d) - ^ illustrate 
the results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) contain the average 
training times and the average test errors of the LS-SVM, one VP-SVM and two RC-SVMs. Here, 
the VP-SVM is the one which trains fastest for ntrain = 400 000 and the two RC-SVMs are those 
which achieve for ntrain = 400 000 roughly the same training time as the chosen VP-SVM. 
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(a) Average training time of 
the various RC-SVMs for 
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(d) Average training time of 
the various VP-SVMs for 
ntrain < 10 000 



(g) Average training time of 
LS-, VP-, and RC-SVMs for 
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(b) Average training time of 
the various RC-SVMs for 
ntrain > 5 000 



(e) Average training time of 
the various VP-SVMs for 
retrain ^ 5 000 



(h) Average training time of 
LS-, VP-, and RC-SVMs for 

retrain ^ 5 000 



various RC-SVMs 



(f) Average empirical risk of the 
various VP-SVMs 



VP-, and RC-SVMs 


(a) 


Figure 9: Average training time and test error of LS-, VP-, and RC-SVMs for the real-world data 
iJCNNl depending on the training set size ntrain = 1 000,..., 100 000. Subfigures 
results for RC-SVMs using different numbers of working sets and Subfigures 
results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) 


(d) 


(f) 


'c) show the 


illustrate the 


contain the average 

training times and the average test errors of the LS-SVM, one VP-SVM and two RC-SVMs. Here, 
the VP-SVM is the one which trains fastest for ntrain = 100 000 and the two RC-SVMs are those 
which achieve for ntrain = 100 000 roughly the same training time as the chosen VP-SVM. 
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(a) Average training time of the 
various RC-SVMs 



(b) Average empirical risk of the 
varions RC-SVMs 



(c) Average empirical L2-error of 
the various RC-SVMs 



(d) Average training time of the 
various VP-SVMs 



various VP-SVMs 



1000 2500 5000 10000 

sample size 


(f) Average empirical L2-error of 
the various VP-SVMs 



(g) Average training time of LS-, 
VP-, and RC-SVMs 



(h) Average empirical risk of 
LS-, VP-, and RC-SVMs 



sample size 

(i) Average empirical L2-error of 
LS-, VP-, and RC-SVMs 


Figure 10; Average training time and test error of LS-, VP-, and RC-SVMs for the artificial data 
Type I depending on the training set size ntrain = 1 000,..., 10 000. Subfigures (a) 
results for RC-SVMs using different numbers of working sets and Subhgures 
results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) 


(d)-(f) 


(c) show the 


illustrate the 


contain the average 

training times and the average test errors of the LS-SVM, one VP-SVM and one RC-SVM. Here, 
the VP-SVM and the RC-SVM are those which train fastest for ntrain = 10 000. Note that in 
the case at hand none of the considered RC-SVMs performs faster than the fastest VP-SVM for 
ntrain = 10 000. 
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(a) Average training time of the 
various RC-SVMs 



sample size 


(d) Average training time of the 
various VP-SVMs 



(g) Average training time of LS-, 
VP-, and RC-SVMs 



(b) Average empirical risk of the 
varions RC-SVMs 



various VP-SVMs 



(h) Average empirical risk of 
LS-, VP-, and RC-SVMs 



(c) Average empirical L2-error of 
the various RC-SVMs 



1000 2500 5000 10000 

sample size 


(f) Average empirical L2-error of 
the various VP-SVMs 



(i) Average empirical L2-error of 
LS-, VP-, and RC-SVMs 


results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) 


(d) 


(a) 


Figure 11: Average training time and test error of LS-, VP-, and RC-SVMs for the artificial data 
Type II depending on the training set size ntrain = 1 000,..., 10 000. Subhgures 
results for RC-SVMs using different numbers of working sets and Subhgures 


(f) 


(c) show the 


illustrate the 
contain the average 

training times and the average test errors of the LS-SVM, one VP-SVM and one RC-SVM. Here, 
the VP-SVM and the RC-SVM are those which train fastest for ntrain = 10 000. Note that in 
the case at hand none of the considered RC-SVMs performs faster than the fastest VP-SVM for 
ntrain = 10 000. 
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(a) Average training time of the 
various RC-SVMs 



(b) Average empirical risk of the 
varions RC-SVMs 



(c) Average empirical L2-error of 
the various RC-SVMs 



(d) Average training time of the 
various VP-SVMs 



sample size 

(e) Average empirical risk of the 
various VP-SVMs 



(f) Average empirical L2-error of 
the various VP-SVMs 



(g) Average training time of LS-, 
VP-, and RC-SVMs 




(h) Average empirical risk of (i) Average empirical L2-error of 
LS-, VP-, and RC-SVMs LS-, VP-, and RC-SVMs 


Figure 12: Average training time and test error of LS-, VP-, and RC-SVMs for the artificial data 
Type III depending on the training set size ntrain = 1 000,..., 10 000. Subfigures 
results for RC-SVMs using different numbers of working sets and Subfigures 
results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) 


(d) 


M 


(a)} (c) show the 


illustrate the 
contain the average 

training times and the average test errors of the LS-SVM, one VP-SVM and one RC-SVM. Here, 
the VP-SVM and the RC-SVM are those which train fastest for ntrain = 10 000. Note that in 
the case at hand none of the considered RC-SVMs performs faster than the fastest VP-SVM for 

Rtrain = 10 000. 
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(a) Average training time of the 
various RC-SVMs 



various RC-SVMs 


CM 



(c) Average empirical L2-error of 
the various RC-SVMs 



(d) Average training time of the 
various VP-SVMs 



various VP-SVMs 



sample size 

(f) Average empirical L2-error of 
the various VP-SVMs 



(g) Average training time of LS-, 
VP-, and RC-SVMs 



(h) Average 
LS-, VP- 


sample size 
empirical risk 
, and RC-SVMs 


of 



sample size 

(i) Average empirical L2-error of 
LS-, VP-, and RC-SVMs 


Figure 13: Average training time and test error of LS-, VP-, and RC-SVMs for the artificial data 
Type IV depending on the training set size ntrain = 1 000,..., 10 000. Subfigures 
results for RC-SVMs using different numbers of working sets and Subfigures 
results for VP-SVMs using various radii. At the bottom, Subfigures (g) - (i) 


(d) 


(c) show the 




illustrate the 
contain the average 

training times and the average test errors of the LS-SVM, one VP-SVM and two RC-SVMs. Here, 
the VP-SVM is the one which trains fastest for Utrain = 10 000 and the two RC-SVMs are those 
which achieve for utrain = 10 000 roughly the same training time as the chosen VP-SVM. 
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(a) Average training time of the 
various RC-SVMs 



sample size 


(d) Average training time of the 
various VP-SVMs 



(g) Average training time of LS-, 
VP-, and RC-SVMs 
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(b) Average empirical risk of the 
various RC-SVMs 



(e) Average empirical risk of the 
various VP-SVMs 



(h) Average empirical risk of 
LS-, VP-, and RC-SVMs 



(c) Average empirical L2-error of 
the varions RC-SVMs 



(f) Average empirical L2-error of 
the various VP-SVMs 



(i) Average empirical L2-error of 
LS-, VP-, and RC-SVMs 


(a) 


Figure 14: Average training time and test error of LS-, VP-, and RC-SVMs for the artificial data 
Type V depending on the training set size ntrain = 1 000,..., 10 000. Subhgures 
results for RC-SVMs using different numbers of working sets and Subhgures 
results for VP-SVMs using various radii. At the bottom, Subhgures (g) - (i) 


(d) 


(f) 


(c) show the 


illustrate the 


contain the average 

training times and the average test errors of the LS-SVM, one VP-SVM and two RC-SVMs. Here, 
the VP-SVM is the one which trains fastest for ntrain = 10 000 and the two RC-SVMs are those 
which achieve for ntrain = 10 000 roughly the same training time as the chosen VP-SVM. 
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Bayes decision function 

average empirical LS-SVM solution 


average empirical VP-SVM solution 
average empirical RC-SVM solution 


Figure 15: Predictions for the artificial data sets of Type I, drawn from the step function in Figure 
6(a) with noise depending on x. The left graphic shows the predictions for the data set of size 


Rtrain = 1 000 and the right graphic for the data set of size utrain = 10 000. Here, note that the VP- 
SVM solutions are not necessarily continuous, nevertheless we continuously connected its predicted 
values in the above plots. 


II, and III, none of the executed RC-SVMs trained faster for ntrain = 10 000 than the VP- 
SVM with the smallest radius, so that we only consider one RC-SVM in these cases. The 
required average training times and average test errors of the selected SVMs are illustrated 
in Subfigures | (g)[l(I)] of Figures fTUHHl and summarized in Tables EHni Here, we note that, 
for the artificial data of Type I, II, and III, the VP-SVM using the smallest of the applied 
radii trains fastest for ntrain = 10000, while for the artificial data of Type IV and V it is 
the VP-SVM using the second smallest radius. 

Expectedly, we detect an evident improvement of the various average empirical SVM 
solutions using 10000 training samples instead of 1000 samples. Besides, the considered 
VP-SVM trains substantially faster than the standard LS-SVM with less than 11% of the 
LS-SVM’s training time for ntrain = 10 000. Additionally, the VP-SVM’s test errors are 
usually considerably lower than the test errors of the LS-SVM. Regarding the test errors of 
the RC-SVMs, we note that, in the majority of cases, they are higher than the VP-SVM’s 
and the LS-SVM’s test errors. 

So far, we examined the behavior of LS-, VP-, and RC-SVMs in terms of training time 
and test error. Let us finally compare the three different kinds of SVMs w.r.t. their optical 
appearance. To this end, the average empirical SVM solutions are plotted in Figures [T5ll20l 
for the different artificial data sets of size ntrain = 1000 and 10 000. Here, note that, for 
the artificial data of Type IV and V, we do not consider both RC-SVMs training roughly 
as fast as the selected VP-SVM but only the one of the both RC-SVMs with the lower test 
error. 

The observation that, for the artificial data of Type I, H, and HI, the VP-SVMs perform 
best, is reinforced by the average empirical VP-SVM solutions illustrated in Figures fTbHlHl 
More precisely, Figure [15] shows that only the VP-SVMs exhaust the widths of the steps 
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Bayes decision function 

average empirical LS-SVM solution 



average empirical VP-SVM solution 
average empirical RC-SVM solution 


Figure 16: Predictions for the artificial data sets of Type II, drawn from the cracked function in 
Figure [6(b)] with noise depending on x. The left graphic shows the predictions for the data set of 
size utrain = 1 000 and the right graphic for the data set of size ntrain = 10 000. 




Bayes decision function 

average empirical LS-SVM solution 




- average empirical VP-SVM solution 

- average empirical RC-SVM solution 


Figure 17: Predictions for the artificial data sets of Type 11. The left graphic shows the predictions 
for X G [—0.55, —0.4] and the data set of size ntrain = 1 000, while the graphic on its right-hand side 
pictures the predictions for the same interval for x and the data set of size ntrain = 10 000. The 
two graphics on the right-hand side illustrate the predictions for x G [0.3, 0.6], the upper one for the 
data set of size ntrain = 1 000 and the lower one for the data set of size ntrain = 10 000. 


of /2 p almost completely. Moreover, in Figure [16] the smoothness interruptions of /£ p are 
again best illustrated by the VP-SVMs, which becomes even more evident in Figure HTj 
Besides, Figure fTHI illustrates that the peaks of /£ p are best reproduced by the VP-SVMs. 
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Bayes decision function 

average empirical LS-SVM solution 


average empirical VP-SVM solution 
average empirical RC-SVM solution 


Figure 18: Predictions for the artificial data sets of Type III, drawn from the jagged function in 


Figure 6(c) with noise depending on x. The left graphic shows the predictions for the data set of 
size utrain = 1 000 End the right graphic for the data set of size ntrain = 10 000. 


Considering the LS- and the RC-SVMs, we can not draw an universally valid conclusion, 
which one performs worse. In particular, Figures [15] and [T6| show that, for the artificial 
data sets of Type I and II, both, the average empirical LS- and RC-SVM solutions, are not 
very well suited to the Bayes decision fnnction. Considering the data sets of Type III, the 
LS-SVMs dominate the RC-SVMs in terms of the better test errors, though both kinds of 
SVMs do not reproduce the peaks of /£ p, especially for small values of |x|. 

It remains to optically analyze the resnlts of the two-dimensional data sets in the fol¬ 
lowing. For the artificial data sets of Type IV, the VP-SVM nsing 10 000 training samples 
achieves the best test error. Moreover, ensuing the optical impression, this VP-SVM is the 
only one of the considered SVM types that reflects the circular steps of the Bayes decision 
function as in Figure [6(d)] cf. Figure [191 Finally, for the data sets of Type V, it is always 
the LS-SVM which performs best in terms of the test errors, cf. Table [TTl This observation 
is also substantiated optically. To be more precise, for ntrain = 1 000, the uneven average 
empirical decision function induced by the VP-SVM (cf. Figure [20l) shows that the RC-SVM 
even performs better than the VP-SVM. However, for ntrain = 10000, the VP-SVM results 
are substantially improved such that the RC-SVM is now outperformed by the VP-SVM. 

Recapulatory, we realize that the VP-SVMs possess the most distinctive ability to handle 
smoothness interruptions of the Bayes decision function in most of our artificial data cases, 
especially if ntrain = 10 000. For the sake of completeness, we point out that the worst 
performance was induced by the RC-SVMs in almost all cases, in particular for a training 
sample size amounting to 10 000. 

6.3 Conclusions 

Finally, we summarize the essential findings of the previous subsections, where we considered 
standard LS-SVMs and two kinds of localized SVMs, namely VP-SVMs and RC-SVMs. As 
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(a) LS-SVM using 1000 training samples 


(b) LS-SVM using 10 000 training samples 




(c) VP-SVM using radius r = 0.5 and 1 000 training 
samples 


(d) VP-SVM using radius r 
samples 


0.5 and 10 000 training 




(e) RC-SVM using 20 working sets and 1000 training 
samples 


(f) RC-SVM using 20 working sets and 10 000 train¬ 
ing samples 


Figure 19: Predictions for the artificial data sets of Type IV, drawn from the circular step function 
in Figure [6(d) I with noise independent of x. 
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(a) LS-SVM using 1000 training samples 


(b) LS-SVM using 10 000 training samples 




(c) VP-SVM using radius r = 0.5 and 1 000 training 
samples 


(d) VP-SVM using radius r 
samples 


0.5 and 10 000 training 




(e) RC-SVM using 15 working sets and 1000 training 
samples 


(f) RC-SVM using 15 working sets and 10 000 train¬ 
ing samples 


Figure 20: Predictions for the artificial data sets of Type V, drawn from the 2-dimensional Euclidean 
norm in Figure 6(e) with noise independent of x. 
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just analyzed in Subsection 16.21 VP-SVMs have the evident advantage that they manage 
smoothness interruptions of the Bayes decision function better than LS- and RC-SVMs. 

The real-world data sets demonstrated that the RC-SVMs perform considerably worse 
than the LS-SVMs and the VP-SVMs, while the performance of VP-SVMs using small 
radii is improved for increasing sample sizes. To be more precise, VP-SVMs outperform 
LS-SVMs or at most leads to a negligible worsening compared to LS-SVMs for a fraction 
of the training time and without memory constraints on the large data sets. For very 
small data sets, however, LS-SVMs actually train faster than VP-SVMs and, hence, are 
preferable. What is more, for data sets of size ntrain < 2 500 all LS-SVMs require less than 
9s to train, so that there are probably no reasons to apply a VP-SVM. Besides, really small 
training sample sizes involve considerably smaller working sets for a VP-SVM using a small 
radius, so that it is hard to hnd a well suited prediction. 

Furthermore, despite a faster training procedure, a VP-SVM using a sufficiently small 
radius induces considerably lower test errors for sample sizes ntrain > 100000 than a LS- 
SVM for training data sets that still enable computational feasibility. 


7. Proofs 


This section is dedicated to prove the results of the previous sections. We begin with the 
proof of Lemma [U relating the radius r of a cover Br(zi),..., Br(zm) of X dehned by Q 
with the number m of centers zi,... ,Zm- 

Proof [Proof of Lemma [T] First of all, let us recall the m-th entropy number of X defined 
by 

m I 

^ U I • 

Since X C cB^d, the m-th entropy number of X can be upper bounded by 


em{X) := inf < e > 0 : 3zi,... ,Zm ^ X such that X 


^miX') ^ 2Em{cB£d^ < 2c£m{B£d'j . 

Additionally, we know by ( Carl and Steohani . 1990l . Section 1.1) that 

m~d. < EmiB^d) < 4m“d ^ 


so that we can hnd a cover of X C cB^d satisfying 

_ 

r < 8cm d . 


7.1 Proofs of Section [3] 

In Section[3]we presented a lemma that related the risk w.r.t. the loss L to the risk w.r.t. the 
restricted loss Lj and also transferred this result to the excess risk. Hereafter, the proof of 
this lemma can be found. 
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Proof [Proof of Lemma 0] Simple transformations using Au B = X and Ar\ B = ^ show 


T^L,p{f) = L{x, y, lA{x)fA{x) + 1b(x)/b(x)) dP{x, y) 
JXxY 


L 


\a{x)L{x, y, /a(x)) + 1b{x)L{x, y, /b(x)) dP{x, y) 

IXxY 

= T^La^pUa) + '^Lb,p(/b) ■ 

The second assertion follows immediately. 


To derive the new oracle inequality of Theorem [5] we first have to relate the entropy 
numbers of Hj, j G {1,... ,m}, to those of H. To t his end, we c o nside r a similar concept 


to entropy numbers, namely cover ing numbers, cf. (jGvorfi et al 
( Steinwart and ChristmannI . 2008a . Definition 6.19). 


20021 . Definition 9.3) or 


Definition 11 Let {T,d) he a metric space and e > t). A subset S C T is called an e-net 
of T if for all t G T there exists an s G S such that d{s,t) < £• Furthermore, we define the 
e-covering number of T by 

Af{T, d, e) := inf > 1 : 3si, . .. ,Sn G T such that T C Bd{si,e) , 
where inf0 := oo and Bfi{s,e) := {t G T : d{t,s) < e}. 


Note that an upper bound on entropy numbers involves a bound on covering numbers. 
To be more precise, for a metric space (T, d) and constants a > 0 and q > 0, the implication 


ei{T, d) < ai > 1 


lnA^(T,d,e) < ln(4) , Ve>0 

(26) 


holds by (ISteinwart and ChristmannI. l2008al. Lemma 6.211. Additionally. I Steinwart and 
Christmann, l2008al . Exercise 6.8) yields the opposite implication, namely 
'a\i 


\nJ\f{T,d,e) < , e>0 


ei(r,d) < 


V f > 1. 
(27) 


Recall that we pursue the target to estimate ej(id : H —)■ L 2 {Px))- In fact, the equivalence 
of entropy and covering numbers enables us to estimate the covering number M^Bh, || • 
\\l 2 {Px)^^) ^ instead. 

Lemma 12 Let v be a distribution on X and A,B G X with An B = i/i. Moreover, let Ha 
and Hb he RKHSs on A and B that are embedded into L 2 {i'\a) and L2 {p\b)j respectively. 
Let the extended RKHSs Ha and Hb be defined as in Lemma 01 and denote their direct sum 
by H as in (m, where the norm is given by (d) with Xa, Xb > 0. Then, for the e-covering 
number of H w.r.t. || • \\l 2 {u)j rue have 

■^{Bh,\\ ■ \\l2(i 2)’^'^ —^ ■ \\B2(i2^A)HAy Af (^^B ^ ■ \\L2{u^g)i^B^ , 

where saj^b > 0 and e := J~Xe\. 
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Proof First of all, we assume that there exist a, 6 E N and functions fi,fa E ^Ha 

and E >^b^^Hb such that {/i,...,/a} is an CA-cover w.r.t. || • IUjC^^ia)’ 

- - _i 

{hi,...,hb} is an es-cover of w.r.t. || • \\l2{u\b), 

_1 _1 
a=^f{^A^^HA^\\ ■ \\L2iu^A)^^A) . 

_ 1 

That is, for every function gA E ^a^a' exists an za E {1,... , o} such that 


9 A ft 




L2{'I^\a) 


< ffA , 


(28) 


and for every function Qb ^ X^^ B^ , there exists an £ {Ij ■ ■ ■ j such that 


93 — h. 


'1-B 


L2{v\b) 


< Eb ■ 


(29) 


Let us now consider an arbitrary function g E Bh- Then there exists an gA E ^a^ ^Ha 

_ 1 _ _ 

an gB E X^^ B^^ such that g = gA + 93- Together with (125)) and (122)) . this implies 


+ hi 


IA ^ 'ns 


L2{u) 


9A- fiAl + [93 - h. 


Hb 


9 a ft 


*A 


L2(i'\a) 


+ 


L2{v) 
..2 

93 - hiB 


L2{v\b) 


<e\ + e% 

=-.e\ 


With this, we know that 


(A 


4 


Ha + hiB ■ /iA € {/i, ■ ■ ■, fa] and hig E {/zi, ..., /ifcj 
is an e-net of H w.r.t. || • Concerning the e-covering number of H, this finally implies 

^{Bh,\\ ■ \\L2{l2),£)<a■b=^f(^X^^^^BfJ^,\\ ■ ||l2(!/|a)’^^) II ■ Ili2(i^|s)>^s)- 


Based on Lemma [T21 the following theorem relates entropy numbers of Ha and Hb to 
those of H. 

Theorem 13 Let Px be a distribution on X and Ai,..., G X be pairwise disjoint. 
Moreover, we assume (H) with weights Ai,... , Xm > 0. In addition, assume that there exist 
constants p E (0,1) and aj > 0, j E {1,..., m], such that for every j G {1,, m] 

ei{id : Hj -A L2(Px\Aj)) < aj i~^ , z > 1. (30) 
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Then we have 


2p 


e,; 


i(id ■.H^L 2 {Tx)) < I 31n(4)^A 

i=i 




i ^ 


i > 1, 


and, for the average entropy numbers, 


2p 


: H L2(Dx)) < Cp^An j ^ A^- | i 2p , 

0=1 


i, n > 1 


Proof [Proof of Theorem [T3] First of all, note that the restriction operator X : —)■ Bhj 

with Xf = / is an isometric isomorphism. Together with ([Steinwart and Christmannl . l2008al . 
(A.36)) and assumption (l30l) . this yields 

e*(A;^i?^^,,L2(Px|A,)) = 2A7^e,(B^^.,L2(Px|A,)) 

< 2\- 2 jjX : Bjj -)■ BHj\\ei{BH,j,L2{Tx\Aj)) 


< 2A- 2p 

J 


Furthermore, we know by (f26l) that 


InAA ( 






2p 


r-2p 


holds for all e > 0. With this and e,- := foi every j E {1,..., m}. Lemma [TJ] implies 


\nAf{BH,\\ ■ ||L 2 (p^),e) < In j JJAA ( Aj. II • ||£, 2 (p,,|^j,e^ 

0=1 


< 


/ 1 

i2(Px|A2A 

j=i V 


™ / _l \2p 

^ln(4) (2A^. ^aj] 


j=i A / 

V ^ / 

/ / m 

\ 


= I 21n(4)2pAm j ^A 

0=1 




r-2p 




Using (|271) . the latter bound for the covering number of Bh finally implies the following 
entropy estimate 


/ 


ej(id : H -A L2(Px)) < 


2p 


21n(4)2pyTu I 




i 2p 


0=1 
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2p 


< 2(31n(4))2p ^/m j X- 




J 


i 2 p . 


V i=i 


The second assertion immediately follows by (jSteinwart and ChristmannI . l2008al . Corollary 
7.31). ■ 


Applying Theorem [T3l we now prove Theorem [5] and thus an oracle inequality for VP- 
SVMs using an ordinary type of losses. 

Proof [Proof of Theorem [5] Since Hi,..., Hm are seperable RKHS of mesurable kernels 
ki,... ,km, is a seperable RKHS and its kernel k is measurable, too. Furthermore, 
Theorem [13] yields 


2p 


EDx~P 5 :ei(id : H -)■ L 2 (Dx)) < Cpy/m j ^ Xj j i , i,n>l. 


That is, we can apply (|Steinwart and Christmas . 2008al . Theorem 7.23) for a regularization 
parameter A = 1 and, for all fixed r > 0 and Xj > 0, j G {1,... , m}, we obtain 


+'^Lj,p(7d,a) 

i=i 

= II/d,a|Ih + ^Lj,p(/d,a) - ^Lj,P 
< 


9 Wh + ^L.,p(/o) - n„p) + Cl + 3 




n 

n 


with proba bility P”' not less than 1 — 3e where (7 > 0 is the constant of I Steinwart and 
Christmann, l2008al . Theorem 7.23) only depending on p, M, V, •&, and B. Moreover, 


a := max 


E^, 

\j=^ 


2p 


-af 




, 


where we need a> B since it is a condition of ( Steinwart and Christmann . I2nn8al. Theorem 
7.23). ■ 


7.2 Proofs Related to the Entropy Estimates of Section |4| 

In this subsection, just as in Section (H we focus on Gaussian RBF kernels and the associated 
RKHSs. To be more precise, we derive a bound for the entropy numbers of H^{A), where 
7 > 0 and A C with A 7 ^ 0. 
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Proof [Proof of Theorem [ 6 ] First of all, we consider the commutative diagram 


H,{A) 


id 


L2(Px\j 




id 


id 


■iooiB) 

and t he restriction operator 


H^iB) - 

where the extension opera tor I a ■ H^(A) —>• _ 

—)• H^{B) given by ( Steinwart and Christman'i] . 2nn8al . Corollary 4.43) are isometric 

iOl 

have 

II/IIl 2 (Px|a) ~ 


isomorphisms, so that °1-A '■ H^{A) — )• H^{B)\\ = 1. Furthermore, for / G £oo{B), we 


lA\f{x)\‘^dFx{x] 


'X 


< 


oo \ I '^AdPxix) ] = ^/PxiA) 


i.e. Hi d : £oo{B) -A- L 2 (Px|a)II ^ \ /Px(A). Together with (ISteinwart and ChristmannI . 
2008al. (A.38) and (A.39)) as well as ( Steinwart and Christmann . 2008a . Theorem 6.27), we 
obtain for all i > 1 


ei{id : H^{A) -A L2{Fx\a)) 

< ||X-i oXa : H^{A) ^ H^{B)\\ • ei(id : H^{B) ^ £^{B)) • ||id : £^{B) ^ L2{Px\a)\\ 


where m > 1 is an arbitrary integer and Cm,,d a positive constant. For p G (0,1), the choice 


m = 


finally yields 


- - - - d+2p d+2p 1 

ei{id : H^{A) ^ L2 (Px\a)) < s/Pxi^Cjn,dr'^'y ""i ^ < Cp-s/Px{A) r 2p 7 i . 


7.3 Proofs Related to the Least Squares VP-SVMs 

In this subsection, we prove the results that are linked with the least squares loss, i.e. the 
results of Section [5l Before we elaborate on the oracle inequality for VP-SVMs using the 
least squares loss as well as RKHSs of Gaussian kernels, we have to examine the excess risk 

Wl.,.p(/o) - = ll/o - /ipllL(p,-,,^) ■ ( 31 ) 

Let us begin by writing for fixed 7 ^ > 0 
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and choosing /o := '^Aj ■ {Kj * /£ p)- Then ([3T]) can be estimated with the help of the 

following theorem, which is together with its proof basically a modification of l Eberts and 
Steinwart, l2ni.‘ll . Theorem 2.2). Indeed, the proofs proceed mainly identically. Note that 
we use the notation 

7max := max{ 7 i,..., 7 ^} and 7min := min{ 7 i,..., 7 ^} 
in the following theorem and the associated proof. 

Theorem 14 Let us fix some q G [l,oo). Assume that u is a finite measure on with 
suppi^ =: X C and let be a partition of X. Furthermore, let f ■. ^ M. 

he sueh that f G for some a > 1. For the funetions Xj : —>■ M, j G {1,..., m}, 

defined by (f32l) . where s := [aj + 1 and 71 , ..., 7 ^ > 0 , we then have 


m / \' 


/max ’ 


where Ca,q ■= II/IIbo^(,.) (i) " ^ 4 r (ga + ^) T 

Proof In the following, we write J := {1, ..., m}. To show 


j&J 


Lfiu) 


< 


I 2 


7\ ■3-^ 

d\ 2 


TT 4 r ( go + 


2 f Tmax V 

V 7min J 


^9“ 

/max ’ 


we have to proceed in a similar way as in the proof of ( Eberts and Steinwart . 20131 . The¬ 
orem 2.2). First of all, we use the translation invariance of the Lebesgue measure and 
exp (—lltflli) = exp (—II — ulll) {u G M'^) to obtain, for x G X and j G J, 




£d I ^ 2 ^ 


exp 


2||x — fill 




exp 


7 


E 


^272 


(-1)^-^ f {x + m) 1 dh. 


f (t) dt 


With this we can derive, for ^ > 1, 

<? 

^iAriKj*f)-f 


j&j 


Lq{u) 


1a^ (x) {Kj * f) (x) - / (x) 




dv (x) 


- LA^ * / (®) “ / (®)l I (®) 
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Then Holder’s 


inequality and exp dh = 



yield, for g > 1, 


j&J 




L^{v) 



exp 


m\i 

' 



Q -1 

9 



i\ 9 
q 


\^h{f^x)\'^dh du{x) 


) |Aft(/,x)|^ dhdv{x) 


r 


3 


\aAx) |A^ (/,x)|‘' dv{x) dh 


|A^ (/,t)|‘^ du{x) dh 
jeJ 



Try, 


2 \ 2 
— 1 exp 


mwi 


< 



Try, 


2 \ 2 
— ) exp 


l|AU/,-)lli^(.) dh 

2 i<L,iu)ifA\hh)dh. 


■-y 2 

/max 


2||/^lli I « 


T 
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Moreover, for g = 1, we have 


(Kj * f) - f 



|A^(/,x)| dhdu{x) 


< 


< 



d 

2 


exp 


d 

2 


exp 


V 7max J 

(/^ \\h\y dh. 

V 7max / 


Consequently, we can proceed in the same way for all (7 > 1. To this end, note that the 
assumption / G implies 0Js,Lq{v) (/T) < \\f\\B^^{u) for ^ > 0. The latter together 

with Holder’s inequality yields 


3&J 


Q 

Lq{v) 


< 


< 


< 



2 


vry^ ■ 

'mm 


d 

2 


exp 


m\i 


UJ 


g 

S,Lq{u) 


ifAlhh) dh 





1 

2 


mwi 



1 

2 


Using the embedding constant d of i^qa ^ 2 ’ obtain 
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for 7 > 0. With the substitution t = functional equation r(t + 1) = tr{t) 

of the Gamma function T, and h we further have 



1 Tmax 

2^ 

1 Tmax 

2^ 


7^ 

/max 


7^ 

/max 


qa 


y^{qa+^) 1 exp (—tt) dn 


qa 


r ga + 


1 


Altogether, we finally obtain 


j&J 


< 


1 a, • {Kj * /) - / 




L>iM 


- (0 


^7min 


^q,ooi^) I 'JT'y^ . 

\ I mi 


27i 


2 

max 


2qoL 

2 exp 




^2 

/max 


dh 


2^ 


7X+''r(,a + i 


— II rii? 


^?,oo(0 \ 2 


d\- 


vr 4 r ( ga + - 


1 \ ^ / 7max \ 


V 7min ) 


/max ' 


1 

2 


Based on Theorems [5] and fT^ we can now show Theorem [71 
Proof [Proof of Theorem [7] First, we have to choose a function /o E H. To this end, we 
define functions Kj : —>• M, j E {1,... ,m}, by (Ih2l) . where s := [aj + 1 and 7 ^ > 0. 

Then we define /o by convolving each Kj with the Bayes decision function /£ p, that is 


fo{x) := Iaj 
j&Jr 


t) • {Kj* f Ip) {x). 


X E 


Now, to show that /o is indeed a suitable function to bound the approximation error, we 
first need to ensure that /o is contained in H. In addition, we need to derive bounds for 
both, the regularization term and the excess risk of fn. To this end, we apply / Eberts and 
Steinwart, 1201.1 Theorem 2.3) and obtain, for every j ^ Jt, 


{K,*flp)^A,eH,^{A,) 


with 


< 2 ( 2 ^ — 1)11/2,p||L2(]Rd) 
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This implies 

/o= E 

jeJr' ' 

Besides, note that 0 E H^^{Aj) for every j E m} such that /o can be written as 

/o = TJ?=i fj, where 






j ^ Jt ■ 


Obviously, the latter implies /o E H. Furthermore, for At := UjeJr Theorem 

[14] yield 

Ki,,,p(/o) - KE,.P = IIA - /tpllh(Px,.,) 

= II E ^Aj{Kj * /2^p) — /2^p 

j&Jt 

d 


'L2(Px\Aj,) 


^ Jt T? \ 2a 

< C'a ,2 ^ max 7 - , 

Vmmjgjj,7jy ieJr 

where Ca ^2 is a constant only depending on a, d, and ||/2 p||b“ (Px\a )• utilize Theorem 
|5l it remains to examine the constants B, V, d, and Bq. Since we consider the least squares 
loss, which can be clipped at M with Y = [—M,M], the supremum bound (|18l) holds 
for B = 4 M^ and the variance bound (llQll for V = 16M^ and d = 1 (cl. Steinwart and 
Christmann, 


2nn8a 


■ Example 7.31. Next, we derive a bound for IlL o 


-, ■ y - - - - - y -- - .using (Eberts 

and Steinwart, 1201,81 . Theorem 2.3) which provides, for every x E X, the supremum bound 


l/o(a;)| = 


lAjix) ■ iKj*flp){x] 

is Jt 


< 


^ 1a,(x) \Kj * flp{x)\< (2* - 1) ||/2,p||^^ 
jeJr 


(33) 


The latter implies 


II^Jt o/olloo = sup \L{y,fo{x))\ 


(x,y)&XxY 

< sup (M^ + 2M|/o(x)l + \Mx)f) 

{x,y)eXxY 


< 4® max 


ll/L,pllE(Rd)} ’ 


i.e. Bq := 4^ max{M^, ||/£ p||^ CRd')}- Moreover, since Theorem [ 6 | provides 

_L - d+2p 

ei{id : H^.{Aj) -A L2(Px\Aj)) < aji for i > 1 with Uj = CpyT^^^r 2 p 7 ^. ^ , 

we have 


2p 


2p 


max< c, 


■'^(Evorj .s 
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^ ^x{Aj 


= max< c„c 




1 d+2p 


/ m 


2p 


(i+2p 


"1 

, 2p 


< I max< CpCpm^pr 2 p I ^ A^- Fx{Aj)\ ,B 


\j=i 


d-\-2p 


2p 


< max<CpCp82pr ^A^ Vj Px(-4j) ,B 


\j=i 

( m d+2p \ P 

=:a2^ 


where we used the concavity of the function t for t > 0, mr’^ < 8'^ by dZD, and 

Cp := Cp^Cp^S'^. Finally, applying Theorem [5] yields 




< 




i=i 


< 9 1 E + Ki,^,p(/o) - n,. 


+ C (a^^n + 3 I + ISBor 


n 


n 


< 




d+2p 


+ IE 1 + 

0=1 


3456M^r 


n 


+ 15-4* max{M^, ||/L,p|li_('Rd')} 


i-l IlljQQ 


n 


<^{r-i)\-nfi 


pIlLfM'*') E + 9C'a,2 7j \ 

,piil2(r 3 h > Vmm.ej^yJ jeJr 


2a 

j 


d+2p 

+ (3456M2 + 15-4^ maxjM^, ||/£,p||i^(^.)}) ^ 
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with probability P” not less than 1 — 3 e . Finally, for f > 1 , a variable transformation 
implies 


j=i ' 


< CM,a,p j 


—d I / BiaXje jj, 7i 


j&Jr 


d-\-2p 


max 


' Px{Aj)] n-^+fn-^ 


ii=i 


with probability P"" not less than 1 — e where the constant CM,a,p is defined by 

I ^\9 —iill f.-ir ||9 ^11 ||9 / d 


CM,a,p ■- max|9(2^ - l)^7r 2 ||/2_p||^2(K‘^) ’ l,p\\b^,^{Px\Aj,) 2 


1 

> 9||/L,pllB?.7Py,^„'i ( F ) ^ '‘rf2a+2 


S'^Ccfcf , WCM^P + ( 3456 M 2 + 15 - 4 " maxjM^ ||/ 2 ,p||i^(Kd)}) (1 + ln( 3 )) 


Next, using the just proven oracle inequality presented in Theorem [71 we show the 
learning rates of Theorem [ 8 ] in only a few steps. 

Proof [Proof of TheoremjH] First of all, we define sequences := C 2 n“^ and 7 ^ := c^n~ 2 “+^ 
to simplify the presentation. Then Theorem [71 Ylj^=i ^x{Aj) = 1 , and | Jt| < mrin < 
together with Xnj = rf^Xn and 7^7 = 7 n for all j G { 1 ,..., rUn} yield 


,p(/d,a„,7„) -TVl 


Jt^ 


< 


( / \d / d+2p \ 


CM.a.p I^Tk^An7n'* + 7n" +^n '^^^A„^’7„ ^Px(^i) U ^ + TU ^ 


vi=i 


< 8‘^CM,a,p {>^n7n‘^ + 7n“ + An^7n d)p^ ^ + m . 

_1 1 _J_ 

Using the choices Xn = C2n , 7 ^ = c^n 20 .+^^ as well as = cin finally implies 


'^Lj.j,,p(/d,A„,7„) - '^lj^,P 

< 8'^CM,a,p (An7n + 7" + ^"^7,7 

~ / _1 d 2a d+2p _ {2-d)£ _ _ 

< ( n n^oi+d+n 2a+d-|-77,^7^2Q+d /sd 77 ^ 


_ 2q: _ 2q: 

= CM,a,p ( n + n 2 a+d + 77 


— 2q: I 1 _i_ 2 I i _ 2 

2a+d'^V-‘-~^2a+d"^/3 /3d 


+ rre 


-1 
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— 2(3: I c 

< CtTI + 

with probability P"" not less than 1 — e~'^, where C > 0 is a constant and 

«> (l + 2:i3 + l-ilj)p> 0 ' ■ 

Proof [Proof of Corollary [9] For simplicity of notation, we write A, Xj, 7 , and 'jj instead 
of Xn, Xnj, 7 „, and 'jnj- Since Ujg^ assumption /£ p G 

^ 2 ,oo(Px|T+«) implies 


/l,p e ^2,oo(Px|U,gjy A, ) • 
With this, Theorems [7] and [ 8 ] immediately yield 
i^L^,p(/D,A,-y)-n^,P 


3 = ^ ^ 

m 


i=i 


< CM,a,p I ^ + 

y sJt 

— 2(3; I ^ 

< Crn 


-d , / maxjg 7j 


^ _ (i+ 2 p 

, maxyf+ r2p| Va-S- ^ Px(vl,) 1 n"! +- 

mmjgjj,7jy jeJr ■' 


with probability P"' not less than 1 — e where ^ + 2 a+d F ~ W) P ^ Moreover, 

the constants CM,a,p > 0 and C > 0 coincide with those of Theorems [7] and [HI ■ 


It remains to prove Theorem (TOl However, we previously have to consider the following 
technical lemma. 


Lemma 15 Let d > 1 and Vn ■= cn ^ with (3 > 1 and a constant c > 0. We fix finite 
subsets An C ( 0 ,r[^] and T^ C ( 0 , r„] such that An is an {rnen)-net of ( 0 ,r[^] and T^ is 
an 6n-net of (0, r^] with 0 < Sn < n~^, 6n > 0, G An, and G r„. Moreover, let 
J C {1,... ,mn} be an arbitrary non-empty index set and | J| < rUn < Then, for all 

0 < a < ^^d, n > 1, and all p G (0,1) with p < ^ 2 a+d +2 ’ have 


inf 

(Aj, 7 H”h\e(A„xr„w- 


J&J 


EW"+( 7 T^)”'iS‘T+’'? 


mmjgj 7j J j&J 


y=i 


(i+2p 


Tx{AA n 


-1 


< C , 

where ^ := ( ( 2 a+ijm 2 Q+dx'i+p)+ 2 p) C > 0 is a constant independent of 

-^n? 0,Tlu Sn- 
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Proof Without loss of generality, we may assume that A„ and r„ are of the form A„ = 
..., and r„ = { 7 ^^^ ... , 7 ^^^} with and = r„ as well as < 

A^*^ and 7 ^^“^) < 7 W for alH = 2,...,rt and i = 2,... ,v. With A^*^^ := 0 and 7 ^*^^ := 0 it is 
easy to see that 

A(*) - < 2r^e„ and 7 ^^^ - < 26n (34) 

_ 2cx+d 

hold for all i = 1,...,M and £ = 1,... ,v. Furthermore, define A* := n and 

1 

7 * := cn ( 2 a+d)(i+p)+ 2 p ^ Then there exist indices i € u} and i G {!,..., u} with 

^(*-1) < < _\{*) and < 7 * < 7 ^^^. Together with (|34l) . this yields 

r^A* < A(*) < riX* + 2r^£„ and 7 * < 7 ^'^) < 7 * + 26n • (35) 


Moreover, the definition of A* implies En < A* and the one of 7 * implies 7 * < for 

a < and p € (0,p*], where p* := ^ 2 a+d +2 ■ Additionally, it is easy to check that 

A* ( 7 *)"'^ + ( 7 *)^" + (A*)"^ (^*)-('^+2p) r(^ 2 -d)p^-i < ( 2 «+d)a+p)+ 2 p w- o}p ^ ( 35 ) 

where c is a positive constant. Using (f35l) . the bound |J| < rUn < and ([36]), we 

obtain 


inf 

(A,',7U7J\e(A„xr„)-" 


J&J 


mmjizj'jjJ j&J 


d-\-2p 




n 


-1 


vi=i 


_ j r\ I "t'n 

<^A«(7W)- +( 7 ^) + 


ieJ 


vi=i 


-1 


7 




d+ 2 p 


^x{Aj) r‘^n ^ 


<|J|A«(7(^))-%(7(^))^%(a«)-^(7 


((■) 


— (rf+2p) 




< I J| (riX* + 2Tien) (7*)-" + ( 7 * + 25n)'“ + {rtX*) 

< 8^ ■ 3A* ( 7 *)"'^ + ( 7 * + 2(5n)^" + {X*)~P ( 7 *)-('^+ 2 p) 

< c (a* (7*)"'^ + (7*)^" + {X*)~P (7*)-(‘^+2p) + c 6 l° 

< (2Q+d)(l+p)+2p+™'^=^{^’0}p _|_ gjSa 

< C (n-^+^ + ^2") 

with ^ ;= { (^2a+d)a2aVd){i%)+2p) + 0})p and constants c > 0 and C > 0 indepen¬ 
dent of n, A„, En, r„, and Sn- ■ 


In the end, we show Theorem 1101 using Theorem [7| as well as Lemma [T5j 
Proof [Proof of Theorem fTO] Let I be defined by I := [ 7 J + 1, i.e. I > 7 . With this. 
Theorem [7| yields with probability P* not less than 1 — |A„ x Pnl”^" 6 “’^ that 

^Ljj,,p(/Di,A, 7 ) - 
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m„ 


d-\-2p 




i=i 

(37) 

for all (Aj, 7 j) E x r„, j E {l,...,m„}, simultaneously, where ci > 0 is a constant 
indep endent of n. r. A. and 'y. Furthermore, the oracle inequality of I Steinwart and Christ- 
mann, bnnsal. Theorem 7.2) for empirical risk minimization, n — I > § — 1 > j, and 
r„ := T + ln(l + |A„ x r„|”*") yield 


,p(/Di,Ad2,7d 2) “ ^Ljy,P 

< 6 ( inf 7^L. ,p(/di,a, 7 ) " p ) + 512M 

V(A,,77)”Tie(A„xr„)-n / 


(38) 


2 


< 6 


inf 


7^L,,,p(/Dl,A,7) - n , ,P + 2048M 


n — I 

2 Ai 


n 


(39) 


^(A,,77)7Tl6(AnXr„)-n 

with probability P"-“^ not less than 1 —With (|37ll . (|39ll and Lemma fTHl we can conclude 
'^Ljj,,p(/Di,Ad2,7d2) “ 


< M_ M , ,p + 2048Af^ 




n 


< 6 ci 


inf 


A CL d 


E A,7--^ + ^ 

^ \ min^g 


{Aj,7T)7\e(Anxr„)-n \ ^ ^ VninLeWAj/ o&Jt 


d+2p 


^ Px(24j) n-1 +rn-^ + 2048M' 


0=1 


2 Ei 
n 


< 6ci (^C + (52") + rn"^) + 20A8M^^ 

< 12ciCn~2a+d+? _|_ (Scj^T + 2048M2r,i) 


with probability P” not less than 1 — (1 + | Ajj x r„|”*'*) e Finally, a variable transfor¬ 
mation yields 


^Ljj,,p(/Di,Ad2,7o2) -^Ljj,,p 

< 12 ciC'n“ 2 sfd+? -I- (6ci (r -|- In (1 -|- |A„ x PnP")) 

+ 2048M2 (r + 21n (1 + |A„ X r„r")) )re"^ 

< 12 ciC'n“ 2 STd+? -|- (6cj^ -)- 2048M2) (r -|- 2m„, In (1 -|- | A„ x PnD) n~^ 

< 12ciCn"^+^ -h (6ci -h 2048M2) (t + 2 ■ 8^"* In (1 |A^ x Pn])) n~^ 
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= l2ciCn + ( 6 ci + 2048M^) ^rn ^ + 2 • 8 '^c ^ In (1 + |A,i x r„|) n p ^ 

< {l2ciC + ( 6 ci + 2048M2) (t + 2- 8 V'^ln (1 + |A„ x r„|))) 

with probability P"" not less than 1 — 6 “"^, where we used 

/3 - 1 _£zil 2c 

a < —^— d n p < n 2 c+ti 

in the last step. 
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Appendix A 

For the sake of completeness, we present in the following some tables containing the compu¬ 
tational results achieved by the LS-, VP-, and RC-SVMs for all real and artificial data set 
types. Here, the training and test times, given in seconds, are averaged over all successful 
runs. Moreover, for the test and L 2 -errors, we also stated the mean of all runs plus/minus 
the standard deviation. The same is true for the number of working sets ( 7 ^ of ws), ex¬ 
cept for the LS-SVMs, where we always have one working set by construction. The last 
two columns contain median, minimum, and maximum of the working set sizes appearing 
during the various runs. 



data set sizes 

runs 

train time 

test time 

test error 

# of ws ws size: median ws size: range 


1000 (10000) 

100 

1.38 

0.48 

0.7142 ±0.0097 

1.00 


2 500 (10000) 

100 

6.42 

0.99 

0.6323 ±0.0074 

1.00 


5 000 (10000) 

100 

30.19 

1.55 

0.5707 ±0.0070 

1.00 

c/2 

10 0 00 (50000) 

10 

138.01 

12.14 

0.4909 ±0.0089 

1.00 


25 000 (50000) 

10 

922.90 

34.19 

0.3816 ±0.0042 

1.00 


50 000 (50000) 

3 

3788 

176.68 

0.3117 ±0.0012 

1.00 


100 000 (50000) 

3 

16 353 

507.38 

0.2417 ±0.0102 

1.00 



data set sizes 

rims 

train time 

test time 

test error 

# of ws 

ws size: median 

ws size: range 


1000 ( 10000 ) 

100 

5.85 

0.73 

0.7521 ±0.0108 

83.73 ±1.86 

6 

[1 . 156| 


2 500 (10000) 

100 

7.44 

1.15 

0.6602 ±0.0082 

101.92 ±2.04 

10 

[1 . 481] 


5 000 (10000) 

100 

9.37 

1.83 

0.6011 ±0.0079 

106.72 ±1.95 

24 

[1 . 987] 

^ " 

10 0 00 (50000) 

10 

13.45 

13.68 

0.5238 ±0.0076 

120.20 ±1.99 

40 

[1 . 952] 


25 000 (50000) 

10 

26.73 

31.07 

0.4134 ±0.0057 

131.00 ±2.26 

80 

[1 . 3204] 

> 1 

50 000 (50000) 

3 

57.41 

183.71 

0.3449 ±0.0047 

139.67 ±4.16 

155 

[3.4334] 


100 000 (50000) 

3 

171.12 

493.44 

0.2658 ±0.0028 

154.33 ±5.51 

271 

[3 . 8121] 


250 000 (50000) 

3 

1128 

1169 

0.1924 ±0.0016 

166.33 ±3.79 

533 

[3 . 22633] 


500 000 (50000) 

3 

5349 

3020 

0.1608 ±0.0024 

178.33 ±0.58 

987 

[2.44585] 


1000 (10000) 

100 

2.79 

0.59 

0.7262 ±0.0092 

37.39 ±1.85 

15 

[1 . 193] 


2 500 (10000) 

100 

3.97 

1.09 

0.6379 ±0.0072 

44.40 ±1.84 

25 

[1 . 493] 


5 000 (10000) 

100 

5.58 

1.74 

0.5845 ±0.0062 

47.59 ±1.66 

50 

[1 . 1008] 

^ " 

10 0 00 (50000) 

10 

10.64 

13.43 

0.5075 ±0.0042 

51.20 ±1.62 

85 

[1 . 1918] 


25 000 (50000) 

10 

31.83 

32.38 

0.4026 ±0.0044 

58.60 ±1.65 

142 

[3.4971] 

> 1 

50 000 (50000) 

3 

120.10 

208.34 

0.3256 ±0.0021 

61.33 ±2.08 

274 

[4 . 10005] 


100 000 (50000) 

3 

444.39 

523.39 

0.2539 ±0.0034 

64.67 ±1.53 

484 

[18 . 20082] 


250 000 (50000) 

3 

2825 

1274 

0.1859 ±0.0018 

68.33 ±1.15 

1169 

[21 . 49558] 


500 000 (50000) 

3 

12 882 

3786 

0.1540 ±0.0010 

72.00 ±1.00 

2366 

[54.74787] 


1000 (10000) 

100 

0.87 

0.53 

0.7095 ±0.0102 

4.40 ±0.49 

165 

[42.489] 


2 500 (10000) 

100 

2.14 

1.05 

0.6317 ±0.0062 

5.02 ±0.49 

382 

[83 . 1248] 


5 000 (10000) 

100 

7.41 

1.74 

0.5737 ±0.0073 

4.94 ±0.55 

723 

[229 . 2401] 

^ " 

10 0 00 (50000) 

10 

28.58 

13.62 

0.5007 ±0.0070 

5.20 ±0.42 

1518 

[488.4514] 

” g 

25 000 (50000) 

10 

201.17 

33.16 

0.3937 ±0.0049 

5.70 ±0.67 

2533 

[566 .11113] 

> 1 

50 000 (50000) 

3 

827.41 

218.85 

0.3225 ±0.0028 

5.33 ±0.58 

7530 

[2870.24509] 


100 000 (50000) 

3 

2823 

548.96 

0.2418 ±0.0007 

7.00 ±0.00 

6356 

[2621 . 40018] 


250 000 (50000) 

3 

20 010 

1434 

0.1689 ±0.0041 

6.33 ±0.58 

17 721 

[6061 . 110464] 


500 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

7.33 ±1.53 

68182 

[12469.233675] 


1000 (10000) 

100 

1.33 

0.48 

0.7138 ±0.0100 

1.06 ±0.24 

1000 

[208 . 1000] 


2 500 (10000) 

100 

5.97 

1.01 

0.6326 ±0.0080 

1.16 ±0.37 

2500 

[375 . 2500] 


5 000 (10000) 

100 

29.20 

1.58 

0.5705 ±0.0071 

1.06 ±0.24 

5000 

[1810 . 5000] 

> " 

10 0 00 (50000) 

10 

131.83 

12.48 

0.4895 ±0.0066 

1.10 ±0.32 

10000 

[4338 . 10000] 

” g 

25 000 (50000) 

10 

832.08 

33.09 

0.3830 ±0.0045 

1.20 ±0.42 

25 000 

[8524.25000] 

^ 1 

50 000 (50000) 

3 

3182 

185.13 

0.3151 ±0.0066 

1.33 ±0.58 

40062 

[19875 . 50000] 


100 000 (50000) 

3 

10472 

527.18 

0.2427 ±0.0030 

1.67 ±0.58 

55 873 

[42218 . 100000] 


250 000 (50000) 

1 (3) 

34449 

1445 

0.1650 ±0.0000 

3.00 ±0.00 

86655 

[46661 . 116684] 


500 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

1.33 ±0.58 

375 000 

[186510 . 500000] 


Table 3: LS- and VP-SVM results relating to the COVTYPE data sets 
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RC-SVM RC-SVM RC-SVM RC-SVM RC-SVM 

# of ws = 7 # of ws = 6 # of ws = 5 of ws = 4 # of ws = 


data set sizes 

runs 

train time 

test time 

test error 

#otws 

ws size: median 

ws size: range 

1000 (10 000) 

100 

1.76 

0.30 

0.7159 ±0.0101 

1.00 ±0.00 

1000 

{1000} 

2 500 (10 000) 

100 

22.55 

2.19 

0.6324 ±0.0073 

1.00 ±0.00 

2500 

{2500} 

5 000 (10 000) 

100 

97.30 

9.16 

0.5707 ±0.0070 

1.00 ±0.00 

5000 

{5000} 

10 000 (50 000) 

10 

741.46 

111.75 

0.4909 ±0.0089 

1.00 ±0.00 

10000 

{10000} 

25 000 (50 000) 

10 

4629 

291.75 

0.3816 ±0.0042 

1.00 ±0.00 

25000 

{25000} 

50 000 (50000) 

3 

11416 

521.84 

0.3117 ±0.0012 

1.00 ±0.00 

50000 

{50000} 

100 000 (50000) 

3 

19921 

622.99 

0.2417 ±0.0102 

1.00 ±0.00 

100000 

{100000} 

250 000 (50 000) 

0 

NA 

NA 

NA ±NA 

NA ±NA 

NA 

{NA} 

500 000 (50 000) 

0 

NA 

NA 

NA ±NA 

NA ±NA 

NA 

{NA} 

1000 (10 000) 

100 

0.59 

0.29 

0.7527 ±0.0143 

4.00 ±0.00 

250 

{250} 

2 500 (10 000) 

100 

4.76 

2.56 

0.6676 ±0.0077 

4.00 ±0.00 

625 

{625} 

5 000 (10 000) 

100 

25.57 

11.60 

0.6254 ±0.0074 

4.00 ±0.00 

1250 

{1250} 

10 000 (50 000) 

10 

89.55 

83.44 

0.5649 ±0.0047 

4.00 ±0.00 

2500 

{2500} 

25 000 (50 000) 

10 

905.89 

243.32 

0.4746 ±0.0062 

4.00 ±0.00 

6250 

{6250} 

50 000 (50 000) 

3 

2160 

529.96 

0.3990 ±0.0032 

4.00 ±0.00 

12500 

{12500} 

100 000 (50 000) 

3 

4191 

826.40 

0.3224 ±0.0016 

4.00 ±0.00 

25000 

{25000} 

250 000 (50 000) 

3 

29909 

1677 

0.2251 ±0.0012 

4.00 ±0.00 

62500 

{62500} 

500 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

4.00 ±0.00 

125000 

{125000} 

1000 (10 000) 

100 

0.69 

0.30 

0.7674 ±0.0142 

5.00 ±0.00 

200 

{200} 

2 500 (10 000) 

100 

3.59 

2.97 

0.6798 ±0.0092 

5.00 ±0.00 

500 

{500} 

5 000 (10 000) 

100 

18.79 

14.25 

0.6404 ±0.0082 

5.00 ±0.00 

1000 

{1000} 

10 000 (50000) 

10 

72.89 

112.97 

0.5790 ±0.0070 

5.00 ±0.00 

2000 

{2000} 

25 000 (50000) 

10 

659.31 

297.07 

0.4886 ±0.0048 

5.00 ±0.00 

5000 

{5000} 

50 000 (50 000) 

3 

1620 

470.55 

0.4216 ±0.0024 

5.00 ±0.00 

10000 

{10000} 

100 000 (50 000) 

3 

5441 

1298 

0.3420 ±0.0040 

5.00 ±0.00 

20000 

{20000} 

250 000 (50 000) 

3 

20828 

1614 

0.2395 ±0.0015 

5.00 ±0.00 

50000 

{50000} 

500 000 (50 000) 

3 

81660 

2794 

0.1931 ±0.0007 

5.00 ±0.00 

100000 

{100000} 

1000 (10 000) 

100 

0.73 

0.29 

0.7776 ±0.0138 

6.00 ±0.00 

167 

[166 , 167] 

2 500 (10 000) 

100 

3.49 

3.27 

0.6911 ±0.0101 

6.00 ±0.00 

417 

[416,417] 

5 000 (10 000) 

100 

15.61 

16.57 

0.6519 ±0.0089 

6.00 ±0.00 

833 

[833 , 834] 

10 000 (50 000) 

10 

50.90 

72.44 

0.5937 ±0.0060 

6.00 ±0.00 

1667 

[1666 , 1667] 

25 000 (50 000) 

10 

230.76 

259.58 

0.5035 ±0.0055 

6.00 ±0.00 

4167 

[4166,4167] 

50 000 (50 000) 

3 

1406 

425.29 

0.4355 ±0.0028 

6.00 ±0.00 

8333 

[8333 , 8334] 

100 000 (50000) 

3 

5139 

1203 

0.3579 ±0.0047 

6.00 ±0.00 

16667 

[16666 , 16667] 

250 000 (50000) 

3 

17099 

1577 

0.2542 ±0.0027 

6.00 ±0.00 

41667 

[41666,41667] 

500 000 (50 000) 

3 

66335 

2534 

0.2013 ±0.0004 

6.00 ±0.00 

83333 

[83333 , 83334] 

1000 (10 000) 

100 

0.72 

0.30 

0.7843 ±0.0146 

7.00 ±0.00 

143 

[142 , 143] 

2 500 (10 000) 

100 

3.76 

6.22 

0.6991 ±0.0090 

7.00 ±0.00 

357 

[357 , 358] 

5 000 (10 000) 

100 

9.51 

10.49 

0.6608 ±0.0072 

7.00 ±0.00 

714 

[714,715] 

10 000 (50 000) 

10 

51.30 

75.81 

0.6045 ±0.0078 

7.00 ±0.00 

1429 

[1428 , 1429] 

25 000 (50 000) 

10 

258.70 

255.08 

0.5163 ±0.0057 

7.00 ±0.00 

3571 

[3571 , 3572] 

50 000 (50 000) 

3 

1087 

440.26 

0.4463 ±0.0026 

7.00 ±0.00 

7143 

[7142 , 7143] 

100 000 (50 000) 

3 

3174 

861.05 

0.3765 ±0.0033 

7.00 ±0.00 

14286 

[14285 , 14286] 

250 000 (50 000) 

3 

15472 

1480 

0.2638 ±0.0012 

7.00 ±0.00 

35714 

[35714 , 35715] 

500 000 (50 000) 

3 

56601 

2628 

0.2094 ±0.0013 

7.00 ±0.00 

71429 

[71428 , 71429] 


_ 1 

data set sizes 

runs 

train time 

test time 

test error 

#ofws 

ws size: median 

ws size: range 



1000 (10000) 

100 

0.99 

0.32 

0.8134 ±0.0133 

10.00 ±0.00 

100 

{100} 



2 500 (10000) 

100 

2.73 

5.29 

0.7188 ±0.0090 

10.00 ±0.00 

250 

{250} 



5 000 (10000) 

100 

6.61 

15.24 

0.6825 ±0.0074 

10.00 ±0.00 

500 

{500} 


II 

10000 (50000) 

10 

31.94 

69.23 

0.6278 ±0.0079 

10.00 ±0.00 

1000 

{1000} 


% 

25 000 (50000) 

10 

259.18 

304.44 

0.5444 ±0.0041 

10.00 ±0.00 

2500 

{2500} 


‘o 

50000 (50000) 

3 

867.90 

645.72 

0.4904 ±0.0044 

10.00 ±0.00 

5000 

{5000} 



100 000 (50000) 

3 

2065 

1204 

0.4120 ±0.0076 

10.00 ±0.00 

10000 

{10000} 



250 000 (50000) 

3 

9534 

1398 

0.3039 ±0.0017 

10.00 ±0.00 

25 000 

{25000} 



500 000 (50000) 

3 

43146 

2689 

0.2300 ±0.0008 

10.00 ±0.00 

50000 

{50000} 



1000 (10000) 

100 

1.67 

0.33 

0.8646 ±0.0082 

20.00 ±0.00 

50 

{50} 



2 500 (10000) 

100 

3.16 

4.43 

0.7574 ±0.0076 

20.00 ±0.00 

125 

{125} 



5 000 (10000) 

100 

3.99 

13.46 

0.7172 ±0.0069 

20.00 ±0.00 

250 

{250} 


II 

10000 (50000) 

10 

22.55 

170.42 

0.6770 ±0.0068 

20.00 ±0.00 

500 

{500} 


% 

25 000 (50000) 

10 

87.06 

236.67 

0.5958 ±0.0054 

20.00 ±0.00 

1250 

{1250} 


‘o 

50000 (50000) 

3 

150.59 

269.68 

0.5470 ±0.0029 

20.00 ±0.00 

2500 

{2500} 



100 000 (50000) 

3 

885.28 

685.50 

0.4760 ±0.0028 

20.00 ±0.00 

5000 

{5000} 



250 000 (50000) 

3 

4559 

1634 

0.3752 ±0.0057 

20.00 ±0.00 

12 500 

{12500} 



500 000 (50000) 

3 

19033 

2802 

0.2971 ±0.0008 

20.00 ±0.00 

25 000 

{25000} 



1000 (10000) 

100 

3.75 

0.38 

0.9280 ±0.0046 

50.00 ±0.00 

20 

{20} 



2 500 (10000) 

100 

6.58 

5.52 

0.8220 ±0.0073 

50.00 ±0.00 

50 

{50} 


o 

5 000 (10000) 

100 

6.04 

15.28 

0.7679 ±0.0081 

50.00 ±0.00 

100 

{100} 


II 

10000 (50000) 

10 

11.50 

124.14 

0.7245 ±0.0058 

50.00 ±0.00 

200 

{200} 



25 000 (50000) 

10 

19.95 

277.44 

0.6551 ±0.0056 

50.00 ±0.00 

500 

{500} 


‘o 

50000 (50000) 

3 

54.19 

221.31 

0.6101 ±0.0032 

50.00 ±0.00 

1000 

{1000} 



100 000 (50000) 

3 

319.38 

1018 

0.5600 ±0.0023 

50.00 ±0.00 

2000 

{2000} 



250 000 (50000) 

3 

2680 

1659 

0.4688 ±0.0037 

50.00 ±0.00 

5000 

{5000} 



500 000 (50000) 

3 

6984 

2352 

0.4011 ±0.0022 

50.00 ±0.00 

10000 

{10000} 



1000 (10000) 

100 

7.14 

0.42 

0.9562 ±0.0035 

100.00 ±0.00 

10 

{10} 



2 500 (10000) 

100 

12.05 

5.85 

0.8752 ±0.0049 

100.00 ±0.00 

25 

{25} 


o 

5 000 (10000) 

100 

10.70 

16.60 

0.8239 ±0.0061 

100.00 ±0.00 

50 

{50} 


II 

10000 (50000) 

10 

12.05 

92.09 

0.7593 ±0.0069 

100.00 ±0.00 

100 

{100} 



25 000 (50000) 

10 

21.60 

276.46 

0.6926 ±0.0037 

100.00 ±0.00 

250 

{250} 



50000 (50000) 

3 

32.39 

206.27 

0.6493 ±0.0033 

100.00 ±0.00 

500 

{500} 



100 000 (50000) 

3 

141.39 

798.77 

0.6082 ±0.0022 

100.00 ±0.00 

1000 

{1000} 



250 000 (50000) 

3 

1325 

1970 

0.5306 ±0.0029 

100.00 ±0.00 

2500 

{2500} 



500 000 (50000) 

3 

3053 

2315 

0.4670 ±0.0025 

100.00 ±0.00 

5000 

{5000} 



1000 (10000) 

100 

11.98 

0.63 

0.9650 ±0.0023 

150.00 ±0.00 

7 

16.7] 



2 500 (10000) 

100 

15.21 

4.66 

0.9086 ±0.0040 

150.00 ±0.00 

17 

[16 , 17] 



5 000 (10000) 

100 

16,94 

14.18 

0.8610 ±0.0053 

150.00 ±0.00 

33 

[33 , 34] 


II 

10000 (50000) 

10 

18.06 

142.93 

0.7936 ±0.0068 

150.00 ±0.00 

67 

[66 , 67] 



25 000 (50000) 

10 

16.47 

183.81 

0.7141 ±0.0039 

150.00 ±0.00 

167 

[166 , 167] 



50000 (50000) 

3 

26.25 

187.73 

0.6742 ±0.0008 

150.00 ±0.00 

333 

[333,334] 



100 000 (50000) 

3 

100.66 

755.96 

0.6337 ±0.0020 

150.00 ±0.00 

667 

[666 , 667] 



250 000 (50000) 

3 

537.02 

1455 

0.5658 ±0.0004 

150.00 ±0.00 

1667 

[1666 , 1667] 



500 000 (50000) 

3 

1859 

2336 

0.5081 ±0.0019 

150.00 ±0.00 

3333 

[3333,3334] 


Table 4: RC-SVM results relating to the COVTYPE data sets 


Optimal Learning Rates for Localized SVMs 



Eberts and Steinwart 


_1 

data set sizes 

runs 

train time 

test time 

test error 

#ofws 

ws size; median 

ws size: range 



1 data set sizes 

rnns 

train time 

test time 

test error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

1.21 

0.25 

0.1916 ±0.0055 

1.00 





1000 

(10000) 

100 

0.87 

0.39 

0,2036 

±0.0050 

3.00 

±0.00 

333 

[333 , 334] 



2 500 (10000) 

100 

8.25 

0.57 

0.1842 ±0.0043 

1.00 





2500 

(10000) 

100 

2.65 

0.63 

0,1925 

±0.0035 

3.00 

±0.00 

833 

[833 , 834] 

s 

> 


5 000 (10000) 

100 

33.98 

1.03 

0.1672 ±0.0032 

1.00 





5000 

(10000) 

100 

9.58 

0.99 

0,1758 

±0.0027 

3.00 

±0.00 

1667 

[1666 , 1667] 



10000 (50000) 

10 

129.26 

11.76 

0.1596 ±0.0032 

1.00 



> 

II 

10000 

(50000) 

10 

34.21 

11,14 

0,1660 

±0.0025 

3.00 

±0.00 

3333 

[3333 , 3334] 

ci 


25000 (50000) 

10 

791.14 

52.75 

0.1453 ±0.0020 

1.00 





25000 

(50000) 

10 

240.40 

20,92 

0,1506 

±0.0017 

3.00 

±0.00 

8333 

[8333 , 8334] 



50000 (50000) 

3 

3029 

169.78 

0.1410 ±0.0022 

1.00 



§ 


50000 

(50000) 

3 

972,97 

33,79 

0,1487 

±0.0003 

3.00 

±0.00 

16667 

[16666 , 16667] 



100000 (50000) 

3 

11078 

201.54 

0.1302 ±0.0008 

1.00 





100000 

(50000) 

3 

3732 

178,60 

0,1353 

±0.0038 

3.00 

±0.00 

33333 

[33333 , 33334] 













250000 

(50000) 

3 

21952 

697,16 

0,1251 

±0.0021 

3.00 

±0.00 

83333 

[83333 , 83334] 



data set sizes 


train time 

test time 


#ofws 

ws size; median 




400000 

(50000) 

0(3) 

NA 

NA 

NA 

±NA 

3.00 

±0.00 

133333 

[133333 , 133334] 



runs 

test error 

ws size: range 


























1000 

(10 000) 

100 

0.82 

0.40 

0,2087 

±0.0050 

4.00 

±0.00 

250 

{250} 



1000 (10000) 

100 

1.81 

0.46 

0.2586 ±0.0090 

21.62 ±1.23 

15 

[1 , 469] 



2500 

(10000) 

100 

1.92 

0.57 

0,1967 

±0.0032 

4.00 

±0.00 

625 

{625} 



2 500 (10000) 

100 

2.90 

0.90 

0.2243 ±0.0058 

25.03 ±1.58 

28 

[1 , 1289] 



5000 

(10000) 

100 

6.82 

0.93 

0,1795 

±0.0028 

4.00 

±0.00 

1250 

{1250} 



5 000 (10000) 

100 

5.19 

1.34 

0.1891 ±0.0036 

26.50 ±1.63 

62 

[2 , 2520] 

> 


10000 

(50000) 

10 

23.97 

9.69 

0,1716 

±0.0020 

4.00 

±0.00 

2500 

{2500} 

2 

> 

II 

10000 (50000) 

10 

12.27 

9.21 

0.1724 ±0.0033 

28.50 ±1.51 

84 

[1 , 5018] 



25000 

(50000) 

10 

172.79 

21,55 

0,1525 

±0.0015 

4.00 

±0.00 

6250 

{6250} 


3 

25000 (50000) 

10 

43.56 

21.51 

0.1522 ±0.0024 

34.10 ±1.52 

209 

[1 , 6015] 

§ 

'S 

50000 

(50000) 

3 

723.49 

36,48 

0,1497 

±0.0005 

4.00 

±0.00 

12500 

{12500} 

CU 

> 

1 

50000 (50000) 

3 

245.63 

169.75 

0.1462 ±0.0027 

39.67 ±2.31 

275 

[3 , 18273] 



100000 

(50000) 

3 

2882 

189,13 

0,1373 

±0.0009 

4.00 

±0.00 

25000 

{25000} 



100000 (50000) 

3 

932.30 

347.92 

0.1352 ±0.0024 

41.00 ±1.00 

535 

[6 , 30230] 



250000 

(50000) 

3 

18055 

807,67 

0,1272 

±0.0010 

4.00 

±0.00 

62500 

{62500} 



250000 (50000) 

3 

9404 

869.18 

0.1277 ±0.0014 

44.67 ±0.58 

1120 

[7 , 112012] 



400000 

(50000) 

1(3) 

44182 

1193 

0,1225 

±0.0000 

4.00 

±0.00 

100000 

{100000} 



400 000 (50000) 

3 

17110 

1437 

0.1152 ±0.0014 

44.67 ±1.15 

1804 

[10 , 116673] 



1000 

(10000) 

100 

0.84 

0.42 

0,2137 

±0.0044 

5.00 

±0.00 

200 

{200} 



1000 (10000) 

100 

1.02 

0.40 

0.2056 ±0.0070 

4.32 ±0.72 

86 

[5 , 956] 



2500 

(10000) 

100 

1.65 

0.50 

0,2001 

±0.0030 

5.00 

±0.00 

500 

{500} 



2 500 (10000) 

100 

2.69 

0.81 

0.1953 ±0.0050 

4.84 ±0.72 

200 

[8 , 2263] 



5000 

(10000) 

100 

5.58 

1.17 

0,1819 

±0.0027 

5.00 

±0.00 

1000 

{1000} 



5 000 (10000) 

100 

10.51 

1.25 

0.1743 ±0.0040 

5.06 ±0.81 

304 

[3 , 4670] 

iS 

> 


10000 

(50000) 

10 

20.48 

11,74 

0,1738 

±0.0026 

5.00 

±0.00 

2000 

{2000} 

2 

II 

10000 (50000) 

10 

45.12 

8.41 

0.1636 ±0.0022 

4.70 ±0.48 

583 

[63 , 8697] 

cn 

r") 


25000 

(50000) 

10 

132,96 

22,39 

0,1556 

±0.0015 

5.00 

±0.00 

5000 

{5000} 

> 


25000 (50000) 

10 

264.42 

34.60 

0.1466 ±0.0038 

5.70 ±0.67 

1360 

[51 , 19740] 

Cd 

'o 

50000 

(50000) 

3 

582,94 

41,03 

0,1510 

±0.0005 

5.00 

±0.00 

10000 

{10000} 

CU 

> 

''i 

50000 (50000) 

3 

1268 

186.33 

0.1445 ±0.0008 

6.33 ±1.15 

1663 

[180 , 42173] 



100000 

(50000) 

3 

2317 

174.35 

0,1397 

±0.0012 

5.00 

±0.00 

20000 

{20000} 



100000 (50000) 

3 

3670 

425.33 

0.1279 ±0.0016 

6.33 ±0.58 

7167 

[359 , 65686] 



250000 

(50000) 

3 

13655 

684,71 

0,1290 

±0.0013 

5.00 

±0.00 

50000 

{50000} 



250000 (50000) 

0 (3) 

NA 

NA 

NA ±NA 

7.33 ±2.08 

34091 

[547,210793] 



400000 

(50000) 

3 

35 724 

1360 

0,1233 

±0.0003 

5.00 

±0.00 

80000 

{80000} 



400 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

7.33 ±1.53 

54545 

[738 , 336202] 



1000 

(10000) 

100 

0.86 

0.42 

0,2176 

±0.0050 

6.00 

±0.00 

167 

[166 , 167] 













2500 

(10000) 

100 

1.55 

0.50 

0,2036 

±0.0037 

6.00 

±0.00 

417 

[416,417] 



1000 (10000) 

100 

1.18 

0.27 

0.1932 ±0.0061 

1.41 ±0.57 

957 

[36 , 1000] 



5000 

(10000) 

100 

5.29 

1.26 

0,1849 

±0.0022 

6.00 

±0.00 

833 

[833 , 834 



2 500 (10000) 

100 

5.28 

0.59 

0.1875 ±0.0055 

1.60 ±0.53 

2371 

[55 , 2500] 

S 

II 

10000 

(50000) 

10 

19.15 

12,14 

0,1748 

±0.0010 

6.00 

±0.00 

1667 

[1666 , 1667] 



5 000 (10000) 

100 

24.31 

1.08 

0.1687 ±0.0035 

1.36 ±0.48 

4795 

[188 , 5000] 


P 

25000 

(50000) 

10 

106,73 

22,69 

0,1580 

±0.0016 

6.00 

±0.00 

4167 

[4166,4167] 

> 

II 

10000 (50000) 

10 

116,00 

8.76 

0.1592 ±0.0032 

1.10 ±0.32 

10000 

[615 , 10000] 

6 

Cd 

'S 

50000 

(50000) 

3 

478,62 

41.69 

0,1540 

±0.0013 

6.00 

±0.00 

8333 

[8333 , 8334] 


3 

25000 (50000) 

10 

580.23 

41.56 

0.1473 ±0.0029 

1.80 ±0.42 

21051 

[752 , 25000] 



100000 

(50000) 

3 

1946 

180,19 

0,1396 

±0.0018 

6.00 

±0.00 

16667 

[16666 , 16667] 

> 

1 

50000 (50000) 

3 

3130 

186.34 

0.1429 ±0.0015 

1.67 ±0.58 

42736 

[1861 , 50000] 



250000 

(50000) 

3 

12080 

690,60 

0,1315 

±0.0012 

6.00 

±0.00 

41667 

[41666 , 41667] 



100000 (50000) 

3 

8886 

371.73 

0.1303 ±0.0029 

2.33 ±0.58 

12300 

[2023 , 97977] 



400000 

(50000) 

3 

31533 

1298 

0,1251 

±0.0006 

6.00 

±0.00 

66667 

[66666 , 66667] 



250000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

2.33 ±0.58 

107143 

[2796 , 247204] 

— 

— 














400 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

2.00 ±0.00 

200000 

[4773 , 395227] 



1000 

(10000) 

100 

1.02 

0.45 

0,2341 

±0.0048 

10.00 

±0.00 

100 

{100} 













2500 

(10000) 

100 

1.45 

0.51 

0.2136 

±0.0030 

10.00 

±0.00 

250 

{250} 



1000 (10000) 

100 

1.15 

0.23 

0.1913 ±0.0056 

1.00 ±0.00 

1000 

{1000} 


O 

5000 

(10000) 

100 

3.42 

1.30 

0,1928 

±0.0026 

10.00 

±0.00 

500 

{500} 



2 500 (10000) 

100 

6.45 

0.57 

0.1842 ±0.0043 

1.00 ±0.00 

2500 

{2500} 


II 

10000 

(50000) 

10 

12.03 

13,35 

0,1831 

±0.0028 

10.00 

±0.00 

1000 

{1000} 


Tf 

5 000 (10000) 

100 

28.18 

1.06 

0.1672 ±0.0032 

1.00 ±0.00 

5000 

{5000} 


P 

25000 

(50000) 

10 

58.63 

22,87 

0,1650 

±0.0013 

10.00 

±0.00 

2500 

{2500} 

> 

II 

10000 (50000) 

10 

119.95 

8.58 

0.1596 ±0.0032 

1.00 ±0.00 

10000 

{10000} 

§ 

'o 

50000 

(50000) 

3 

264.45 

47,21 

0,1613 

±0.0008 

10.00 

±0.00 

5000 

{5000} 


3 

25000 (50000) 

10 

807.92 

72.42 

0.1452 ±0.0020 

1.10 ±0.32 

25000 

[6292 , 25000] 



100000 

(50000) 

3 

1166 

190,86 

0,1450 

±0.0022 

10.00 

±0.00 

10000 

{10000} 

> 

1 

50000 (50000) 

3 

3539 

197.21 

0.1410 ±0.0022 

1.00 ±0.00 

50000 

{50000} 



250000 

(50000) 

3 

7119 

643,38 

0,1374 

±0.0012 

10.00 

±0.00 

25000 

{25000} 



100000 (50000) 

3 

12 679 

389.43 

0.1302 ±0.0008 

1.00 ±0.00 

100000 

{100000} 



400000 

(50000) 

3 

18495 

1317 

0,1306 

±0.0014 

10.00 

±0.00 

40000 

{40000} 



250000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

1.00 ±0.00 

250000 

{250000} 

— 

— 














400 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

1.00 ±0.00 

400000 

{400000} 



1000 

(10000) 

100 

1.61 

0.50 

0,2736 

±0.0071 

20.00 

±0.00 

50 

{50} 













2500 

(10000) 

100 

1.76 

0.55 

0,2365 

±0.0038 

20.00 

±0.00 

125 

{125} 












o 

5000 

(10000) 

100 

2.84 

1.33 

0,2067 

±0.0024 

20.00 

±0.00 

250 

{250} 



data set sizes 

runs 

train time 

test time 

test error 

#ofws 

ws size; median 

ws size: range 

> 

II 

10000 

(50000) 

10 

6.79 

14.07 

0,1946 

±0.0026 

20.00 

±0.00 

500 

{500} 












p 

25 000 

(50000) 

10 

31.20 

23.20 

0.1730 

±0.0021 

20.00 

±0.00 

1250 

{1250} 



1000 (10000) 

100 

1.10 

0.23 

0.1912 ±0.0054 

1.00 ±0.00 

1000 

{1000} 

§ 

'S 

50000 

(50000) 

3 

115.38 

46,53 

0,1699 

±0.0003 

20.00 

±0.00 

2500 

{2500} 



2 500 (10000) 

100 

7.11 

0.53 

0.1842 ±0.0042 

1.00 ±0.00 

2500 

{2500} 



100000 

(50000) 

3 

524.48 

201,80 

0,1548 

±0.0009 

20.00 

±0.00 

5000 

{5000} 



5 000 (10000) 

100 

32.00 

0.87 

0.1672 ±0.0032 

1.00 ±0.00 

5000 

{5000} 



250000 

(50000) 

3 

3688 

616,18 

0,1418 

±0.0006 

20.00 

±0.00 

12500 

{12500} 

2 

> 


10000 (50000) 

10 

119.75 

11.36 

0.1595 ±0.0033 

1.00 ±0.00 

10000 

{10000} 



400000 

(50000) 

3 

9783 

1119 

0,1374 

±0.0003 

20.00 

±0.00 

20000 

{20000} 


P 

25000 (50000) 

10 

737.28 

24.31 

0.1453 ±0.0020 

1.00 ±0.00 

25000 

{25000} 

— 

— 













"o 

50000 (50000) 

3 

2688 

39.19 

0.1410 ±0.0022 

1.00 ±0.00 

50000 

{50000} 



1000 

(10000) 

100 

3.35 

0.51 

0,4617 

±0.0284 

50.00 

±0.00 

20 

{20} 


=#: 

100000 (50000) 

3 

10227 

163.67 

0.1302 ±0.0008 

1.00 ±0.00 

100000 

{100000} 



2500 

(10000) 

100 

3.31 

0.63 

0,2958 

±0.0079 

50.00 

±0.00 

50 

{50} 



250000 (50000) 

0 

NA 

NA 

NA ±NA 

NA ±NA 

NA 

{NA} 


s 

5000 

(10000) 

100 

4.79 

1.56 

0,2351 

±0.0034 

50.00 

±0.00 

100 

{100} 



400 000 (50000) 

0 

NA 

NA 

NA ±NA 

NA ±NA 

NA 

{NA} 

> 

II 

10000 

(50000) 

10 

5.77 

15,00 

0,2125 

±0.0014 

50.00 

±0.00 

200 

{200} 

— 

— 









cn 

p 

25000 

(50000) 

10 

25.22 

23,75 

0,1867 

±0.0013 

50.00 

±0.00 

500 

{500} 



1000 (10000) 

100 

1.06 

0.38 

0.1964 ±0.0049 

2.00 ±0.00 

500 

{500} 

§ 

'o 

50000 

(50000) 

3 

52.65 

43,21 

0,1812 

±0.0004 

50.00 

±0.00 

1000 

{1000} 



2 500 (10000) 

100 

3.88 

0.59 

0.1881 ±0.0037 

2.00 ±0.00 

1250 

{1250} 



100000 

(50000) 

3 

184.55 

170,12 

0,1666 

±0.0002 

50.00 

±0.00 

2000 

{2000} 


CS 

5 000 (10000) 

100 

14.59 

0.94 

0.1721 ±0.0030 

2.00 ±0.00 

2500 

{2500} 



250000 

(50000) 

3 

1324 

588,68 

0,1547 

±0.0003 

50.00 

±0.00 

5000 

{5000} 

2 

> 

II 

10000 (50000) 

10 

55.61 

11.33 

0.1644 ±0.0026 

2.00 ±0.00 

5000 

{5000} 



400000 

(50000) 

3 

3825 

1008 

0,1491 

±0.0009 

50.00 

±0.00 

8000 

{8000} 


P 

25000 (50000) 

10 

369.81 

18.29 

0.1465 ±0.0022 

2.00 ±0.00 

12500 

{12500} 















"o 

50000 (50000) 

3 

1422 

39.07 

0.1439 ±0.0017 

2.00 ±0.00 

25000 

{25000} 















=#: 

100000 (50000) 

3 

5465 

184.18 

0.1325 ±0.0003 

2.00 ±0.00 

50000 

{50000} 
















250000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

2.00 ±0.00 

125 000 

{125000} 
















400 000 (50000) 

0(3) 

NA 

NA 

NA ±NA 

2.00 ±0.00 

200000 

{200000} 















Table 5: Experimental results relating to the COD-rna data sets 



VP-SVM VP-SVM VP-SVM VP-SVM LS-SVM 

radius = 4 radius = 3 radius = 2 radius = 1 


data set sizes runs train time test time test error # of ws ws size: median ws size: range 

1000 (10000) 100 1.13 0.41 0.1582 ±0.0052 

2 500 (10000) 100 4.08 0.80 0.1099 ±0.0028 

5 000 (10000) 100 1 8.37 1.31 0.1010 ±0.0016 

10000 (50000) 10 82.65 9.29 0.0728 ±0.0006 

25 000 (50000) 10 526.48 24.21 0.0542 ±0.0004 

50000 (50000) 3 2146 46.89 0.0448 ±0.0003 

100000 (50000) 3 8907 246.99 0.0365 ±0.0002 


data set sizes 

runs 

train time 

test time 

test error 

# of ws 

ws size: median 

ws size: range 

1000 (10000) 

100 

4.22 

0.58 

0.2496 ±0.0074 

58.07 ±1.99 

10 

[1 , 831 

2 500 (10000) 

100 

5.79 

0.95 

0.1489 ±0.0043 

76.80 ±1.93 

17 

[l , 177] 

5 000 (10000) 

100 

7.72 

1.40 

0.1218 ±0.0029 

93.44 ±2.24 

21 

[1 , 336] 

10 000 (50000) 

10 

11.40 

9.93 

0.0834 ±0.0023 

115.70 ±2.63 

29 

[1 , 618] 

25 000 (50000) 

10 

24.17 

30.30 

0.0614 ±0.0014 

154.10 ±3.35 

43 

[1 , 1835] 

50 000 (50000) 

3 

86.99 

201.41 

0.0455 ±0.0012 

162.33 ±4.04 

92 

[2 , 2808] 

100 000 (50000) 

3 

264.22 

356.52 

0.0365 ±0.0019 

199.00 ±7.81 

127 

[1 , 5386] 

1000 (10000) 

100 

1.48 

0.47 

0.2038 ±0.0061 

18.08 ±0.69 

67 

[12 , 110] 

2 500 (10000) 

100 

1.92 

0.89 

0.1185 ±0.0034 

18.93 ±0.74 

169 

[23 , 276] 

5 000 (10000) 

100 

3.42 

1.42 

0.0977 ±0.0024 

19.41 ±0.57 

339 

[16 , 539] 

10 000 (50000) 

10 

7.24 

14.77 

0.0660 ±0.0015 

20.40 ±0.52 

348 

[41 , 1010] 

25 000 (50000) 

10 

28.04 

34.35 

0.0513 ±0.0010 

22.20 ±0.92 

664 

[64,2516] 

50 000 (50000) 

3 

196.07 

235.05 

0.0408 ±0.0002 

23.33 ±0.58 

1236 

[171 , 4336] 

100 000 (50000) 

3 

688.80 

410.05 

0.0343 ±0.0006 

24.33 ±0.58 

2141 

[306 , 9959] 

1000 (10000) 

100 

0.97 

0.47 

0.1857 ±0.0050 

9.98 ±0.14 

103 

[84 , 152] 

2 500 (10000) 

100 

1.34 

0.90 

0.1109 ±0.0036 

10.00 ±0.00 

252 

[226 , 276] 

5 000 (10000) 

100 

3.38 

1.93 

0.0966 ±0.0023 

10.00 ±0.00 

498 

[467,539] 

10 000 (50000) 

10 

8.67 

14.17 

0.0660 ±0.0008 

10.00 ±0.00 

1000 

[962 , 1059] 

25 000 (50000) 

10 

64.97 

74.29 

0.0507 ±0.0005 

10.00 ±0.00 

2494 

[2447,2566] 

50 000 (50000) 

3 

310.74 

248.32 

0.0409 ±0.0004 

10.00 ±0.00 

5002 

[4902 , 5147] 

100 000 (50000) 

3 

1127 

421.73 

0.0342 ±0.0004 

10.00 ±0.00 

9988 

[9912 , 10070] 

1000 (10000) 

100 

1.12 

0.39 

0.1582 ±0.0052 

1.00 ±0.00 

1000 

{1000} 

2 500 (10000) 

100 

3.94 

0.81 

0.1099 ±0.0028 

1.00 ±0.00 

2500 

{2500} 

5 000 (10000) 

100 

21.67 

1.36 

0.1010 ±0.0016 

1.00 ±0.00 

5000 

{5000} 

10 000 (50000) 

10 

92.45 

11.62 

0.0728 ±0.0006 

1.00 ±0.00 

10000 

{10000} 

25 000 (50000) 

10 

1052 

110.79 

0.0542 ±0.0004 

1.00 ±0.00 

25000 

{25000} 

50 000 (50000) 

3 

3621 

201.31 

0.0448 ±0.0003 

1.00 ±0.00 

50000 

{50000} 

100 000 (50000) 

3 

12439 

333.57 

0.0365 ±0.0002 

1.00 ±0.00 

100000 

{100000} 


1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 


_ 1 

data set sizes 

runs 

train time 

test time 

test error 

#ofws 

ws size: median 

ws size: range 



1000 (10000) 

100 

1.08 

0.38 

0.1582 ±0.0052 

1.00 ±0.00 

1000 

{1000} 



2 500 (10000) 

100 

4.01 

0.85 

0.1099 ±0.0028 

1.00 ±0.00 

2500 

{2500} 



5 000 (10000) 

100 

17.47 

1.29 

0.1010 ±0.0016 

1.00 ±0.00 

5000 

{5000} 


& 

10000 (50000) 

10 

79.67 

8.62 

0.0728 ±0.0006 

1.00 ±0.00 

10 000 

{10000} 


‘o 

25 000 (50000) 

10 

500.55 

22.20 

0.0542 ±0.0004 

1.00 ±0.00 

25 000 

{25000} 



50000 (50000) 

3 

2388 

116.48 

0.0448 ±0.0003 

1.00 ±0.00 

50000 

{50000} 



100 000 (50000) 

3 

8592 

231.08 

0.0365 ±0.0002 

1.00 ±0.00 

100000 

{100000} 



1000 (10000) 

100 

0.66 

0.39 

0.2250 ±0.0075 

5.00 ±0.00 

200 

{200} 


lO 

2 500 (10000) 

100 

1.26 

0.84 

0.1698 ±0.0050 

5.00 ±0.00 

500 

{500} 

s 

II 

5 000 (10000) 

100 

3.73 

1.45 

0.1480 ±0.0031 

5.00 ±0.00 

1000 

{1000} 



10000 (50000) 

10 

12.55 

11.46 

0.1093 ±0.0015 

5.00 ±0.00 

2000 

{2000} 

§ 

O 

25 000 (50000) 

10 

88.31 

27.18 

0.0805 ±0.0009 

5.00 ±0.00 

5000 

{5000} 



50000 (50000) 

3 

612.93 

226.31 

0.0624 ±0.0003 

5.00 ±0.00 

10000 

{10000} 



100 000 (50000) 

3 

1623 

301.96 

0.0488 ±0.0001 

5.00 ±0.00 

20000 

{20000} 



1000 (10000) 

100 

0.94 

0.36 

0.2648 ±0.0059 

10.00 ±0.00 

100 

{100} 



2 500 (10000) 

100 

1.40 

0.90 

0.2064 ±0.0049 

10.00 ±0.00 

250 

{250} 


II 

5 000 (10000) 

100 

3.20 

1.39 

0.1806 ±0.0039 

10.00 ±0.00 

500 

{500} 

03 


10000 (50000) 

10 

6.99 

10.92 

0.1372 ±0.0027 

10.00 ±0.00 

1000 

{1000} 


‘o 

25 000 (50000) 

10 

40.29 

29.37 

0.0992 ±0.0007 

10.00 ±0.00 

2500 

{2500} 



50000 (50000) 

3 

199.33 

168.55 

0.0763 ±0.0006 

10.00 ±0.00 

5000 

{5000} 



100 000 (50000) 

3 

892.77 

499,74 

0.0590 ±0.0004 

10.00 ±0.00 

10000 

{10000} 



1000 (10000) 

100 

1.61 

0.37 

0.3066 ±0.0038 

20.00 ±0.00 

50 

{50} 


o 

2 500 (10000) 

100 

1.95 

0.86 

0.2461 ±0.0039 

20.00 ±0.00 

125 

{125} 


II 

5 000 (10000) 

100 

2.95 

1.35 

0.2156 ±0.0043 

20.00 ±0.00 

250 

{250} 



10000 (50000) 

10 

4.67 

10.56 

0.1704 ±0.0031 

20.00 ±0.00 

500 

{500} 


‘o 

25 000 (50000) 

10 

22.59 

29.35 

0.1275 ±0.0008 

20.00 ±0.00 

1250 

{1250} 



50000 (50000) 

3 

92.33 

168.15 

0.0964 ±0.0011 

20.00 ±0.00 

2500 

{2500} 



100 000 (50000) 

3 

1081 

613,04 

0.0731 ±0.0004 

20.00 ±0.00 

5000 

{5000} 



1000 (10000) 

100 

4.14 

0.43 

0.3478 ±0.0048 

50.00 ±0.00 

20 

{20} 


o 

2 500 (10000) 

100 

3.74 

0.84 

0.2966 ±0.0026 

50.00 ±0.00 

50 

{50} 


II 

5 000 (10000) 

100 

4.97 

1.31 

0.2694 ±0.0036 

50.00 ±0.00 

100 

{100} 



10000 (50000) 

10 

5.46 

10.90 

0.2220 ±0.0029 

50.00 ±0.00 

200 

{200} 


‘o 

25 000 (50000) 

10 

18.23 

25.04 

0.1715 ±0.0031 

50.00 ±0.00 

500 

{500} 



50000 (50000) 

3 

39.18 

143.68 

0.1366 ±0.0014 

50.00 ±0.00 

1000 

{1000} 



100 000 (50000) 

3 

133.67 

264.38 

0.1036 ±0.0028 

50.00 ±0.00 

2000 

{2000} 



1000 (10000) 

100 

6.86 

0.51 

0.3502 ±0.0040 

100.00 ±0.00 

10 

{10} 


o 

2 500 (10000) 

100 

7.37 

0.92 

0.3268 ±0.0031 

100.00 ±0.00 

25 

{25} 


II 

5 000 (10000) 

100 

7.58 

1.31 

0.3123 ±0.0022 

100.00 ±0.00 

50 

{50} 



10000 (50000) 

10 

8.58 

8.93 

0.2678 ±0.0032 

100.00 ±0.00 

100 

{100} 



25 000 (50000) 

10 

15.91 

21.95 

0.2077 ±0.0021 

100.00 ±0.00 

250 

{250} 



50000 (50000) 

3 

27.94 

132,07 

0.1679 ±0.0023 

100.00 ±0.00 

500 

{500} 



100 000 (50000) 

3 

79.05 

227.69 

0.1341 ±0.0017 

100.00 ±0.00 

1000 

{1000} 



1000 (10000) 

100 

10.52 

0.62 

0.3468 ±0.0026 

150.00 ±0.00 

7 

[6,7] 



2 500 (10000) 

100 

10.43 

1.00 

0.3328 ±0.0024 

150.00 ±0.00 

17 

[16 , 17] 

s 

II 

5 000 (10000) 

100 

9.34 

1.23 

0.3395 ±0.0082 

150.00 ±0.00 

33 

[33.34] 

c/p 


10000 (50000) 

10 

11.38 

9.10 

0.2943 ±0.0014 

150.00 ±0.00 

67 

[66.67] 

§ 


25 000 (50000) 

10 

17.25 

19.51 

0.2336 ±0.0041 

150.00 ±0.00 

167 

[166 , 167] 



50000 (50000) 

3 

25.66 

122.27 

0.1927 ±0.0005 

150.00 ±0.00 

333 

[333 , 334] 



100 000 (50000) 

3 

73.12 

188.29 

0.1538 ±0.0012 

150.00 ±0.00 

667 

[666 , 667] 



1000 (10000) 

100 

13.70 

0.64 

0.3447 ±0.0021 

200.00 ±0.00 

5 

{5} 


o 

2 500 (10000) 

100 

14.45 

1.11 

0.3346 ±0.0020 

200.00 ±0.00 

12 

[12 , 13] 


II 

5 000 (10000) 

100 

12.32 

1.22 

0.3534 ±0.0092 

200.00 ±0.00 

25 

{25} 

C/p 


10000 (50000) 

10 

13.19 

8.93 

0.3091 ±0.0024 

200.00 ±0.00 

50 

{50} 



25 000 (50000) 

10 

19.80 

18.41 

0.2512 ±0.0030 

200.00 ±0.00 

125 

{125} 



50000 (50000) 

3 

26.89 

87.32 

0.2095 ±0.0033 

200.00 ±0.00 

250 

{250} 



100 000 (50000) 

3 

70.20 

185.72 

0.1664 ±0.0021 

200.00 ±0.00 

500 

{500} 


Table 6: Experimental results relating to the iJCNNl data sets 


Optimal Learning Rates for Localized SVMs 



Eberts and Steinwart 


_1 

data set sizes 

runs 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



data set sizes 

runs 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

0.33 

0.07 

0.0284 ±0.0003 

0.0541 ±0.0033 

1.00 





1000 (10000) 

100 

0.60 

0.12 

0.0178 ±0.0006 

0.0641 ±0.0047 

1.00 



> 


2 500 (10000) 

100 

1.74 

0.07 

0.0275 ±0.0002 

0.0461 ±0.0019 

1.00 



> 


2 500 (10000) 

100 

2.29 

0.21 

0.0168 ±0.0004 

0.0559 ±0.0033 

1.00 



CO 


5 000 (10000) 

100 

7.14 

0.14 

0.0269 ±0.0002 

0.0396 ±0.0019 

1.00 





5 000 (10000) 

100 

8.37 

0.25 

0.0165 ±0.0003 

0.0531 ±0.0026 

1.00 





10000 (10000) 

100 

27.90 

0.22 

0.0265 ±0.0001 

0.0323 ±0.0014 

1.00 





10 000 (10000) 

100 

31.61 

0.44 

0.0163 ±0.0003 

0.0511 ±0.0033 

1.00 





data set sizes 

r„« 

train time 

test time 

test error 

Z/ 2 -error 

# of ws 

ws size: median 

ws size: range 



data set sizes 


train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

1.84 

0.28 

0.0287 ±0.0005 

0.0561 ±0.0041 

15.70 ±2.43 

64 

[27 , 109] 

s 


1000 (10000) 

100 

1.55 

0.29 

0.0157 ±0.0004 

0.0427 ±0.0040 

16.08 ±2.50 

62 

[22 , 114] 

> 

o 

2 500 (10000) 

100 

1.90 

0.34 

0.0267 ±0.0002 

0.0368 ±0.0027 

15.77 ±2.34 

159 

[59 , 268] 

> 

d 

2 500 (10000) 

100 

1.66 

0.36 

0.0146 ±0.0002 

0.0288 ±0.0028 

15.79 ±2.40 

158 

[51 , 292] 

Cu 


5 000 (10000) 

100 

2.01 

0.40 

0.0260 ±0.0001 

0.0248 ±0.0018 

16.04 ±2.45 

312 

[125 , 532 ] 

Cu 

II 

5 000 (10000) 

100 

1.88 

0.43 

0.0142 ±0.0001 

0.0208 ±0.0016 

15.95 ±2.53 

313 

[105 , 549] 



10 000 (10000) 

100 

3.05 

0.55 

0.0257 ±0.0001 

0.0197 ±0.0010 

15.55 ±2.43 

643 

[245,1054] 



10 000 (10000) 

100 

3.09 

0.59 

0.0140 ±0.0001 

0.0163 ±0.0013 

15.84 ±2.50 

631 

[231 , 1043] 

S 


1000 (10000) 

100 

0.65 

0.16 

0.0277 ±0.0004 

0.0465 ±0.0037 

6.17 ±0.88 

162 

[66 , 251] 

S 

in 

1000 (10000) 

100 

0.58 

0.17 

0.0156 ±0.0004 

0.0426 ±0.0045 

6.21 ±0.83 

161 

[60 , 267] 

> 

o 

2 500 (10000) 

100 

0.75 

0.18 

0.0265 ±0.0002 

0.0338 ±0.0024 

6.12 ±0.87 

408 

[152 , 645i 

> 

d 

2 500 (10000) 

100 

0.77 

0.22 

0.0147 ±0.0002 

0.0299 ±0.0030 

6.29 ±0.81 

397 

[133 , 676] 

a. 

II 

5 000 (10000) 

100 

1.58 

0.24 

0.0260 ±0.0001 

0.0256 ±0.0016 

6.17 ±0.79 

810 

[305 , 1291] 

a. 

II 

5 000 (10000) 

100 

1.64 

0.30 

0.0144 ±0.0001 

0.0247 ±0.0023 

6.22 ±0.82 

804 

[285 , 1299] 



10 000 (10000) 

100 

4.93 

0.38 

0.0258 ±0.0001 

0.0213 ±0.0013 

6.27 ±0.83 

1595 

[600 , 2550i 



10 000 (10000) 

100 

4.99 

0.44 

0.0141 ±0.0001 

0.0194 ±0.0016 

6.38 ±0.80 

1567 

[597 , 2535] 

S 


1000 (10000) 

100 

0.37 

0.11 

0.0275 ±0.0003 

0.0447 ±0.0032 

3.49 ±0.50 

287 

[121 , 502] 

s 


1000 (10000) 

100 

0.36 

0.12 

0.0159 ±0.0004 

0.0465 ±0.0050 

3.47 ±0.50 

288 

[122,482] 

> 

o 

2 500 (10000) 

100 

0.77 

0.14 

0.0266 ±0.0002 

0.0355 ±0.0025 

3.44 ±0.50 

727 

[329 , 1202] 

> 

d 

2 500 (10000) 

100 

0.81 

0.18 

0.0151 ±0.0002 

0.0370 ±0.0031 

3.52 ±0.50 

710 

[297 , 1248] 

Cu 

II 

5 000 (10000) 

100 

2.31 

0.19 

0.0262 ±0.0001 

0.0290 ±0.0021 

3.41 ±0.49 

1466 

[627,251li 

a. 

II 

5 000 (10000) 

100 

2.52 

0.28 

0.0150 ±0.0002 

0.0346 ±0.0032 

3.47 ±0.50 

1441 

[604 , 2508] 



10 000 (10000) 

100 

8.88 

0.31 

0.0259 ±0.0001 

0.0242 ±0.0019 

3.51 ±0.50 

2849 

[1253 , 5003] 



10 000 (10000) 

100 

9.36 

0.43 

0.0147 ±0.0002 

0.0301 ±0.0034 

3.58 ±0.50 

2793 

[1249,4962] 



1000 (10000) 

100 

0.27 

0.08 

0.0282 ±0.0003 

0.0517 ±0.0030 

2.00 ±0.00 

500 

[266 , 734] 

S 


1000 (10000) 

100 

0.33 

0.09 

0.0169 ±0.0005 

0.0562 ±0.0047 

2.00 ±0.00 

500 

[265 , 735] 

> 


2 500 (10000) 

100 

0.97 

0.11 

0.0272 ±0.0002 

0.0426 ±0.0024 

2.00 ±0.00 

1250 

[658 , 1842] 

> 


2 500 (10000) 

100 

1.11 

0.14 

0.0160 ±0.0004 

0.0477 ±0.0041 

2.00 ±0.00 

1250 

[613 , 1887] 

a. 


5 000 (10000) 

100 

3.68 

0.15 

0.0268 ±0.0002 

0.0382 ±0.0021 

2.00 ±0.00 

2500 

[1224 , 3776] 

Cu 


5 000 (10000) 

100 

4.11 

0.23 

0.0158 ±0.0004 

0.0452 ±0.0044 

2.00 ±0.00 

2500 

[1232 , 3768] 



10 000 (10000) 

100 

14.93 

0.44 

0.0263 ±0.0001 

0.0309 ±0.0020 

2.00 ±0.00 

5000 

[2501 , 7499 ] 



10 000 (10000) 

100 

16.26 

0.42 

0.0154 ±0.0005 

0.0402 ±0.0064 

2.00 ±0.00 

5000 

[2533 , 7467] 



1000 (10000) 

100 

0.32 

0.07 

0.0284 ±0.0004 

0.0541 ±0.0033 

1.00 ±0.00 

1000 

{1000} 

S 


1000 (10000) 

100 

0.37 

0.08 

0.0178 ±0.0006 

0.0641 ±0.0045 

1.00 ±0.00 

1000 

{1000} 

> 


2 500 (10000) 

100 

1.75 

0.08 

0.0275 ±0.0002 

0.0461 ±0.0020 

1.00 ±0.00 

2500 

{2500} 

> 

'n 

2 500 (10000) 

100 

2.03 

0.12 

0.0168 ±0.0003 

0.0558 ±0.0033 

1.00 ±0.00 

2500 

{2500} 

cL 

s. 

5 000 (10000) 

100 

7.16 

0.15 

0.0269 ±0.0002 

0.0396 ±0.0019 

1.00 ±0.00 

5000 

{5000} 

Cu 


5 000 (10000) 

100 

8.12 

0.20 

0.0165 ±0.0003 

0.0531 ±0.0026 

1.00 ±0.00 

5000 

{5000} 



10 000 (10000) 

100 

27.91 

0.24 

0.0265 ±0.0001 

0.0323 ±0.0014 

1.00 ±0.00 

10000 

{10000} 



10 000 (10000) 

100 

31.12 

0.36 

0.0163 ±0.0003 

0.0511 ±0.0033 

1.00 ± 0.00 

10000 

{10000} 



data set sizes 

rum 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



data set sizes 

ruur 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

0.32 

0.07 

0.0284 ±0.0003 

0.0540 ±0.0031 

1.00 ±0.00 

1000 

{1000} 

S 

- 

1000 (10000) 

100 

0.37 

0.07 

0.0178 ±0.0006 

0.0640 ±0.0046 

1.00 ±0.00 

1000 

{1000} 

> 

II 

2 500 (10000) 

100 

1.76 

0.08 

0.0275 ±0.0002 

0.0461 ±0.0020 

1.00 ±0.00 

2500 

{2500} 

> 

II 

2 500 (10000) 

100 

2.02 

0.13 

0.0168 ±0.0003 

0.0558 ±0.0033 

1.00 ±0.00 

2500 

{2500} 

C) 

& 

5 000 (10000) 

100 

7.16 

0.15 

0.0269 ±0.0002 

0.0396 ±0.0019 

1.00 ±0.00 

5000 

{5000} 

6 

S 

5 000 (10000) 

100 

8.14 

0.19 

0.0165 ±0.0003 

0.0531 ±0.0026 

1.00 ±0.00 

5000 

{5000} 

cc 


10 000 (10000) 

100 

28.05 

0.24 

0.0265 ±0.0001 

0.0323 ±0.0014 

1.00 ±0.00 

10000 

{10000} 

as 

=4fc 

10 000 (10000) 

100 

31.45 

0.36 

0.0163 ±0.0003 

0.0511 ±0.0033 

1.00 ± 0.00 

10000 

{10000} 



1000 (10000) 

100 

0.26 

0.09 

0.0287 ±0.0005 

0.0571 ±0.0042 

2.00 ±0.00 

500 

{500} 

S 

c. 

1000 (10000) 

100 

0.30 

0.11 

0.0172 ±0.0005 

0.0595 ±0.0046 

2.00 ±0.00 

500 

{500} 

> 

II 

2 500 (10000) 

100 

0.94 

0.11 

0.0276 ±0.0002 

0.0475 ±0.0018 

2.00 ±0.00 

1250 

{1250} 

> 

II 

2 500 (10000) 

100 

1.10 

0.15 

0.0169 ±0.0004 

0.0567 ±0.0037 

2.00 ±0.00 

1250 

{1250} 

5 

$ 

5 000 (10000) 

100 

3.46 

0.14 

0.0273 ±0.0001 

0.0444 ±0.0014 

2.00 ±0.00 

2500 

{2500} 

6 

§ 

5 000 (10000) 

100 

4.06 

0.25 

0.0165 ±0.0003 

0.0528 ±0.0029 

2.00 ±0.00 

2500 

{2500} 



10000 (10000) 

100 

14.37 

0.28 

0.0268 ±0.0001 

0.0376 ±0.0016 

2.00 ±0.00 

5000 

{5000} 



10 000 (10000) 

100 

16.33 

0.41 

0.0163 ± 0.0002 

0.0516 ±0.0020 

2.00 ±0.00 

5000 

{5000} 

S 


1000 (10000) 

100 

0.31 

0.11 

0.0288 ±0.0004 

0.0583 ±0.0037 

3.00 ±0.00 

333 

[333 , 334] 

S 


1000 (10000) 

100 

0.32 

0.14 

0.0168 ±0.0004 

0.0555 ±0.0040 

3.00 ±0.00 

333 

[333 , 334] 

> 

II 

2 500 (10000) 

100 

0.76 

0.14 

0.0277 ±0.0002 

0.0488 ±0.0022 

3.00 ±0.00 

833 

[833 , 834] 

> 

II 

2 500 (10000) 

100 

0.88 

0.17 

0.0169 ±0.0004 

0.0566 ±0.0033 

3.00 ± 0.00 

833 

[833 , 834] 

6 

& 

5000 (10000) 

100 

2.32 

0.17 

0.0274 ±0.0001 

0.0456 ±0.0014 

3.00 ±0.00 

1667 

[1666 , 1667] 

Q 

S 

5000 (10000) 

100 

2.71 

0.28 

0.0165 ± 0.0002 

0.0534 ±0.0022 

3.00 ± 0.00 

1667 

[1666 , 1667] 


=14: 

10 000 (10000) 

100 

9.58 

0.27 

0.0271 ±0.0001 

0.0417 ±0.0015 

3.00 ±0.00 

3333 

i3333 , 3334 ] 



10 000 (10000) 

100 

10.99 

0.45 

0.0163 ±0.0002 

0.0515 ±0.0018 

3.00 ± 0.00 

3333 

[3333 , 3334] 



1000 (10000) 

100 

0.43 

0.12 

0.0289 ±0.0004 

0.0588 ±0.0037 

4.00 ±0.00 

250 

{250} 

S 


1000 (10000) 

100 

0.39 

0.17 

0.0166 ±0.0004 

0.0544 ±0.0039 

4.00 ±0.00 

250 

{250} 

> 

II 

2 500 (10000) 

100 

0.69 

0.16 

0.0278 ±0.0003 

0.0500 ±0.0025 

4.00 ±0.00 

625 

{625} 

> 

II 

2 500 (10000) 

100 

0.81 

0.21 

0.0166 ±0.0003 

0.0537 ±0.0032 

4.00 ±0.00 

625 

{625} 

6 

& 

5 000 (10000) 

100 

1.90 

0.20 

0.0275 ±0.0002 

0.0469 ±0.0016 

4.00 ±0.00 

1250 

{1250} 

Q 

s 

5 000 (10000) 

100 

2.23 

0.31 

0.0165 ±0.0003 

0.0532 ±0.0027 

4.00 ±0.00 

1250 

{1250} 


=44: 

10 000 (10000) 

100 

6.95 

0.27 

0.0272 ±0.0001 

0.0432 ±0.0010 

4.00 ±0.00 

2500 

{2500} 


=4fc 

10 000 (10000) 

100 

8.16 

0.49 

0.0163 ± 0.0002 

0.0516 ±0.0020 

4.00 ±0.00 

2500 

{2500} 



1000 (10000) 

100 

0.52 

0.15 

0.0287 ±0.0004 

0.0577 ±0.0033 

5.00 ±0.00 

200 

{200} 

S 

.0 

1000 (10000) 

100 

0.46 

0.19 

0.0167 ±0.0004 

0.0553 ±0.0041 

5.00 ±0.00 

200 

{200} 

> 

II 

2 500 (10000) 

100 

0.67 

0.18 

0.0280 ±0.0002 

0.0515 ±0.0022 

5.00 ±0.00 

500 

{500} 

> 

II 

2 500 (10000) 

100 

0.78 

0.25 

0.0165 ±0.0004 

0.0532 ±0.0035 

5.00 ± 0.00 

500 

{500} 

6 

& 

5000 (10000) 

100 

1.65 

0.23 

0.0276 ±0.0002 

0.0480 ±0.0020 

5.00 ±0.00 

1000 

{1000} 

Q 

S 

5 000 (10000) 

100 

1.94 

0.32 

0.0164 ±0.0003 

0.0524 ±0.0029 

5.00 ± 0.00 

1000 

{1000} 


=4fc 

10000 (10000) 

100 

5.52 

0.30 

0.0273 ±0.0001 

0.0438 ±0.0012 

5.00 ±0.00 

2000 

{2000} 


=4fc 

10 000 (10000) 

100 

6.51 

0.51 

0.0164 ± 0.0002 

0.0521 ±0.0022 

5.00 ± 0.00 

2000 

{2000} 



1000 (10000) 

100 

0.64 

0.16 

0.0288 ±0.0004 

0.0586 ±0.0031 

6.00 ±0.00 

167 

[166 , 167] 

S 


1000 (10000) 

100 

0.53 

0.21 

0.0168 ±0.0004 

0.0556 ±0.0031 

6.00 ±0.00 

167 

[166 , 167] 

> 

II 

2 500 (10000) 

100 

0.69 

0.19 

0.0281 ±0.0003 

0.0525 ±0.0028 

6.00 ±0.00 

417 

[416 , 417 ] 

> 

II 

2 500 (10000) 

100 

0.76 

0.28 

0.0164 ±0.0003 

0.0520 ±0.0029 

6.00 ± 0.00 

417 

[416,417] 

r) 

& 

5 000 (10000) 

100 

1.52 

0.25 

0.0277 ±0.0002 

0.0490 ±0.0022 

6.00 ±0.00 

833 

[833 , 834j 

6 


5 000 (10000) 

100 

1.77 

0.37 

0.0164 ±0.0003 

0.0518 ±0.0027 

6.00 ±0.00 

833 

[833 , 834] 

cc 

=44: 

10 000 (10000) 

100 

4.67 

0.32 

0.0273 ±0.0001 

0.0446 ±0.0011 

6.00 ±0.00 

1667 

[1666 , 1667] 

oi 


10 000 (10000) 

100 

5.49 

0.54 

0.0165 ± 0.0002 

0.0530 ±0.0019 

6.00 ± 0.00 

1667 

[1666 , 1667] 


o 

1000 (10000) 

100 

1.05 

0.22 

0.0288 ±0.0003 

0.0584 ±0.0029 

10.00 ±0.00 

100 

{100} 

S 

0 

1000 (10000) 

100 

0.85 

0.27 

0.0171 ±0.0003 

0.0585 ±0.0027 

10.00 ±0.00 

100 

{100} 

> 

II 

2 500 (10000) 

100 

1.08 

0.27 

0.0282 ±0.0003 

0.0534 ±0.0026 

10.00 ±0.00 

250 

{250} 

> 

II 

2 500 (10000) 

100 

0.98 

0.40 

0.0160 ±0.0003 

0.0487 ±0.0027 

10.00 ± 0.00 

250 

{250} 

5 


5 000 (10000) 

100 

1.39 

0.34 

0.0280 ±0.0002 

0.0522 ±0.0023 

10.00 ±0.00 

500 

{500} 

6 

% 

5 000 (10000) 

100 

1.56 

0.52 

0.0160 ± 0.0002 

0.0478 ±0.0023 

10.00 ± 0.00 

500 

{500} 


=44: 

10 000 (10000) 

100 

3.45 

0.44 

0.0275 ±0.0001 

0.0471 ±0.0012 

10.00 ±0.00 

1000 

{1000} 


=4fc 

10 000 (10000) 

100 

3.91 

0.63 

0.0164 ±0.0002 

0.0524 ±0.0017 

10.00 ±0.00 

1000 

{1000} 


o 

1000 (10000) 

100 

2.28 

0.34 

0.0293 ±0.0004 

0.0626 ±0.0027 

20.00 ±0.00 

50 

{50} 

S 

0 

1000 (10000) 

100 

1.83 

0.37 

0.0196 ±0.0007 

0.0774 ±0.0042 

20.00 ±0.00 

50 

{50} 

> 

II 

2 500 (10000) 

100 

2.06 

0.43 

0.0282 ±0.0002 

0.0531 ±0.0022 

20.00 ±0.00 

125 

{126} 

> 

II 

2 500 (10000) 

100 

1.97 

0.56 

0.0163 ± 0.0002 

0.0512 ±0.0020 

20.00 ±0.00 

125 

{125} 

6 

s 

5 000 (10000) 

100 

2.38 

0.51 

0.0283 ±0.0002 

0.0544 ±0.0017 

20.00 ±0.00 

250 

{250} 

Q 

% 

5 000 (10000) 

100 

2.20 

0.79 

0.0156 ±0.0002 

0.0446 ±0.0018 

20.00 ±0.00 

250 

{250} 


=44: 

10 000 (10000) 

100 

3.14 

0.64 

0.0280 ±0.0001 

0.0520 ±0.0014 

20.00 ±0.00 

500 

{500} 


=4fc 

10 000 (10000) 

100 

3.19 

0.99 

0.0160 ±0.0002 

0.0488 ±0.0018 

20.00 ±0.00 

500 

{500} 


Table 7: Experimental results relating to the artificial data of Type I Table 8: Experimental results relating to the artificial data of Type 

II 



_ 1 

data set sizes 

runs 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



data set sizes 

runs 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

0.36 

0.05 

0.0588 ±0.0009 

0.0799 ±0.0056 

1.00 





1000 (10000) 

100 

0.61 

0.07 

0.0155 ±0.0002 

0.0848 ±0.0012 

1.00 



> 


2 500 (10000) 

100 

1.82 

0.07 

0.0572 ±0.0005 

0.0677 ±0.0039 

1.00 



g 


2 500 (10000) 

100 

3.28 

0.16 

0.0151 ±0.0003 

0.0821 ±0.0016 

1.00 



CO 


5 000 (10000) 

100 

7.47 

0.11 

0.0580 ±0.0003 

0.0745 ±0.0019 

1.00 





5 000 (10000) 

100 

11.89 

0.63 

0.0132 ±0.0007 

0.0694 ±0.0049 

1.00 





10 000 (10000) 

100 

29.56 

0.24 

0.0559 ±0.0003 

0.0563 ±0.0028 

1.00 





10000 (10000) 

100 

45.94 

0.58 

0.0137 ±0.0007 

0.0730 ±0.0046 

1.00 





data set sizes 

r„« 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



data set sizes 

rum 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

1.52 

0.29 

0.0580 ±0.0009 

0.0731 ±0.0057 

15.55 ±2.36 

64 

[26 , 109] 

s 

in 

1000 (10000) 

100 

4.27 

0.62 

0.0172 ±0.0005 

0.0946 ±0.0024 

44.23 ±2.19 

23 

[4 , 54] 

> 

o 

2 500 (10000) 

100 

1.61 

0.35 

0.0553 ±0.0004 

0.0493 ±0.0038 

15.99 ±2.39 

156 

[56 , 279] 

> 

d 

2 500 (10000) 

100 

4.78 

0.85 

0.0142 ±0.0002 

0.0775 ±0.0015 

46.09 ±2.58 

54 

[13 , 108] 

pi 


5 000 (10000) 

100 

1.61 

0.42 

0.0539 ±0.0002 

0.0347 ±0.0024 

15.80 ±2.45 

316 

[116 , 53li 

pi 

II 

5 000 (10000) 

100 

5.03 

1.16 

0.0129 ±0.0002 

0.0682 ±0.0010 

46.32 ±2.61 

108 

[19 , 244] 



10 000 (10000) 

100 

3.14 

0.58 

0.0534 ±0.0001 

0.0257 ±0.0016 

15.73 ±2.58 

636 

[236 , 1068] 



10000 (10000) 

100 

5.72 

1.76 

0.0119 ±0.0001 

0.0607 ±0.0007 

47.52 ±2.72 

210 

[43,416] 

s 


1000 (10000) 

100 

0.53 

0.16 

0.0563 ±0.0007 

0.0600 ±0.0057 

6.18 ±0.87 

162 

[56 , 259] 

s 


1000 (10000) 

100 

1.45 

0.30 

0.0158 ±0.0003 

0.0867 ±0.0019 

14.50 ±1.25 

69 

[18 , 155] 

> 

o 

2 500 (10000) 

100 

0.77 

0.19 

0.0547 ±0.0003 

0.0429 ±0.0034 

6.26 ±0.81 

399 

[150 , 658i 

> 

d 

2 500 (10000) 

100 

1.62 

0.49 

0.0139 ±0.0002 

0.0752 ±0.0013 

15.22 ±1.47 

164 

[40 , 323] 

cL 

II 

5 000 (10000) 

100 

1.65 

0.26 

0.0537 ±0.0001 

0.0315 ±0.0024 

6.27 ±0.81 

797 

[321 , 1270] 

pi 

II 

5000 (10000) 

100 

2.17 

0.78 

0.0129 ±0.0002 

0.0683 ±0.0011 

15.00 ±1.45 

333 

[78 , 635] 



10000 (10000) 

100 

5.11 

0.39 

0.0534 ±0.0001 

0.0244 ±0.0021 

6.34 ±0.79 

1577 

[597,2534j 



10 000 (10000) 

100 

4.52 

1.24 

0.0120 ±0.0001 

0.0611 ±0.0009 

15.72 ±1.62 

636 

[144 , 1261] 

S 


1000 (10000) 

100 

0.34 

0.11 

0.0564 ±0.0007 

0.0620 ±0.0057 

3.48 ±0.50 

287 

[126 , 501] 

s 


1000 (10000) 

100 

0.49 

0.18 

0.0152 ±0.0003 

0.0833 ±0.0017 

5.16 ±0.53 

194 

[69,478] 

> 

o 

2 500 (10000) 

100 

0.81 

0.14 

0.0551 ±0.0004 

0.0478 ±0.0047 

3.49 ±0.50 

716 

[297 , 1244] 

> 

“■ 

2 500 (10000) 

100 

0.99 

0.32 

0.0140 ±0.0003 

0.0753 ±0.0019 

5.40 ±0.51 

463 

[156 , 1260] 

pI 

II 

5000 (10000) 

100 

2.44 

0.22 

0.0543 ±0.0004 

0.0409 ±0.0048 

3.50 ±0.50 

1429 

[641,2517] 


s. 

5 000 (10000) 

100 

2.77 

0.58 

0.0132 ±0.0003 

0.0697 ±0.0020 

5.48 ±0.54 

912 

[264 , 2437] 



10 000 (10000) 

100 

9.40 

0.30 

0.0538 ±0.0004 

0.0322 ±0.0054 

3.54 ±0.50 

2825 

[1209 , 4993] 



10000 (10000) 

100 

10.03 

0.80 

0.0124 ±0.0002 

0.0638 ±0.0013 

5.56 ±0.64 

1799 

[529,4868] 



1000 (10000) 

100 

0.28 

0.08 

0.0579 ±0.0009 

0.0738 ±0.0062 

2.00 ±0.00 

500 

[248,752] 

s 


1000 (10000) 

100 

0.43 

0.10 

0.0155 ±0.0002 

0.0847 ±0.0014 

1.72 ±0.45 

581 

[283 , 1000] 

> 


2 500 (10000) 

100 

1.04 

0.11 

0.0565 ±0.0007 

0.0617 ±0.0059 

2.00 ±0.00 

1250 

[625 , 1875] 

> 


2 500 (10000) 

100 

2.29 

0.15 

0.0149 ±0.0004 

0.0808 ±0.0024 

1.55 ±0.50 

1613 

[696 , 2500] 

pi 


5 000 (10000) 

100 

3.92 

0.15 

0.0564 ±0.0005 

0.0622 ±0.0044 

2.00 ±0.00 

2500 

[1281 , 3719] 

pi 

s. 

5000 (10000) 

100 

7.96 

0.42 

0.0138 ±0.0007 

0.0739 ±0.0049 

1.67 ±0.47 

2994 

[1345 , 5000] 



10 000 (10000) 

100 

15.93 

0.23 

0.0555 ±0.0004 

0.0520 ±0.0035 

2.00 ±0.00 

5000 

i2404 , 7596] 



10000 (10000) 

100 

31.92 

0.82 

0.0126 ±0.0009 

0.0656 ±0.0065 

1.64 ±0.48 

6098 

[2702 , 10000] 



1000 (10000) 

100 

0.35 

0.06 

0.0588 ±0.0009 

0.0799 ±0.0057 

1.00 ±0.00 

1000 

{1000} 

s 


1000 (10000) 

100 

0.54 

0.07 

0.0155 ±0.0002 

0.0847 ±0.0012 

1.00 ±0.00 

1000 

{1000} 

> 


2 500 (10000) 

100 

1.84 

0.08 

0.0572 ±0.0005 

0.0677 ±0.0039 

1.00 ±0.00 

2500 

{2500} 

> 

'n 

2 500 (10000) 

100 

3.09 

0.12 

0.0151 ±0.0003 

0.0822 ±0.0016 

1.00 ±0.00 

2500 

{2500} 

pi 

5. 

5 000 (10000) 

100 

7.51 

0.12 

0.0580 ±0.0003 

0.0745 ±0.0019 

1.00 ±0.00 

5000 

{5000} 

pi 


5 000 (10000) 

100 

11.64 

0.53 

0.0132 ±0.0007 

0.0694 ±0.0049 

1.00 ±0.00 

5000 

{5000} 



10000 (10000) 

100 

29.75 

0.25 

0.0559 ±0.0003 

0.0563 ±0.0028 

1.00 ±0.00 

10000 

{10000} 



10 000 (10000) 

100 

45.40 

0.54 

0.0137 ±0.0007 

0.0730 ±0.0046 

1.00 ±0.00 

10000 

{10000} 



data set sizes 

rum 

train time 

test time 

test error 

L2-error 

#ofws 

ws size: median 

ws size: range 



data set sizes 

runs 

train time 

test time 

test error 

L2-error 

# of ws 

ws size: median 

ws size: range 



1000 (10000) 

100 

0.35 

0.07 

0.0588 ±0.0009 

0.0801 ±0.0058 

1.00 ±0.00 

1000 

{1000} 

s 

- 

1000 (10000) 

100 

0.53 

0.06 

0.0155 ±0.0002 

0.0847 ±0.0012 

1.00 ±0.00 

1000 

{1000} 

> 

II. 

2 500 (10000) 

100 

1.85 

0.08 

0.0572 ±0.0005 

0.0678 ±0.0039 

1.00 ±0.00 

2500 

{2500} 

> 

II 

2 500 (10000) 

100 

3.08 

0.12 

0.0151 ±0.0003 

0.0822 ±0.0016 

1.00 ±0.00 

2500 

{2500} 

C) 

is 

5 000 (10000) 

100 

7.73 

0.12 

0.0580 ±0.0003 

0.0745 ±0.0019 

1.00 ±0.00 

5000 

{5000} 

6 

S 

5 000 (10000) 

100 

11.67 

0.52 

0.0132 ±0.0007 

0.0694 ±0.0049 

1.00 ±0.00 

5000 

{5000} 

p: 


10000 (10000) 

100 

30.24 

0.27 

0.0559 ±0.0003 

0.0563 ±0.0028 

1.00 ±0.00 

10000 

{10000} 

CP 


10 000 (10000) 

100 

46.22 

0.55 

0.0137 ±0.0007 

0.0730 ±0.0046 

1.00 ±0.00 

10000 

{10000} 



1000 (10000) 

100 

0.30 

0.08 

0.0596 ±0.0009 

0.0855 ±0.0052 

2.00 ±0.00 

500 

{500} 

S 

c. 

1000 (10000) 

100 

0.37 

0.14 

0.0153 ±0.0002 

0.0837 ±0.0015 

2.00 ±0.00 

500 

{500} 

> 

II 

2 500 (10000) 

100 

1.01 

0.11 

0.0579 ±0.0005 

0.0736 ±0.0033 

2.00 ±0.00 

1250 

{1250} 

> 

II 

2 500 (10000) 

100 

1.61 

0.13 

0.0152 ±0.0002 

0.0830 ±0.0012 

2.00 ±0.00 

1250 

{1250} 

5 

$ 

5000 (10000) 

100 

3.81 

0.15 

0.0573 ±0.0005 

0.0699 ±0.0038 

2.00 ±0.00 

2500 

{2500} 

6 

% 

5 000 (10000) 

100 

6.12 

0.22 

0.0151 ±0.0002 

0.0825 ±0.0014 

2.00 ±0.00 

2500 

{2500} 



10 000 (10000) 

100 

15.22 

0.24 

0.0562 ±0.0004 

0.0598 ±0.0035 

2.00 ±0.00 

5000 

{5000} 



10000 (10000) 

100 

23.51 

0.97 

0.0130 ±0.0006 

0.0680 ±0.0046 

2.00 ±0.00 

5000 

{5000} 

S 


1000 (10000) 

100 

0.32 

0.12 

0.0591 ±0.0010 

0.0825 ±0.0067 

3.00 ±0.00 

333 

[333 , 334] 

S 

lO 

1000 (10000) 

100 

0.46 

0.22 

0.0152 ±0.0002 

0.0830 ±0.0014 

5.00 ±0.00 

200 

{200} 

> 

II 

2 500 (10000) 

100 

0.82 

0.14 

0.0585 ±0.0006 

0.0776 ±0.0038 

3.00 ±0.00 

833 

[833 , 834] 

> 

II 

2 500 (10000) 

100 

0.96 

0.30 

0.0150 ±0.0002 

0.0815 ±0.0009 

5.00 ±0.00 

500 

{500} 

6 

& 

5 000 (10000) 

100 

2.52 

0.18 

0.0574 ±0.0004 

0.0710 ±0.0031 

3.00 ±0.00 

1667 

[1666 , 1667] 

Q 

s 

5000 (10000) 

100 

2.70 

0.28 

0.0153 ±0.0001 

0.0833 ±0.0008 

5.00 ±0.00 

1000 

{1000} 



10 000 (10000) 

100 

10.05 

0.24 

0.0564 ±0.0004 

0.0622 ±0.0032 

3.00 ±0.00 

3333 

i3333 , 3334] 



10 000 (10000) 

100 

9.95 

0.36 

0.0152 ±0.0001 

0.0826 ±0.0007 

5.00 ±0.00 

2000 

{2000} 



1000 (10000) 

100 

0.37 

0.15 

0.0589 ±0.0011 

0.0814 ±0.0069 

4.00 ±0.00 

250 

{250} 

S 

O 

1000 (10000) 

100 

0.88 

0.29 

0.0154 ±0.0002 

0.0844 ±0.0011 

10.00 ±0.00 

100 

{100) 

> 

II 

2 500 (10000) 

100 

0.77 

0.16 

0.0587 ±0.0006 

0.0793 ±0.0038 

4.00 ±0.00 

625 

{625} 

> 

II 

2 500 (10000) 

100 

1.02 

0.46 

0.0148 ±0.0001 

0.0808 ±0.0007 

10.00 ±0.00 

250 

{250} 

6 

& 

5 000 (10000) 

100 

2.07 

0.20 

0.0579 ±0.0004 

0.0747 ±0.0025 

4.00 ±0.00 

1250 

{1250} 

Q 

% 

5 000 (10000) 

100 

1.96 

0.58 

0.0150 ±0.0001 

0.0818 ±0.0008 

10.00 ±0.00 

500 

{500} 


=1fc 

10 000 (10000) 

100 

7.41 

0.27 

0.0563 ±0.0004 

0.0616 ±0.0036 

4.00 ±0.00 

2500 

{2500} 



10000 (10000) 

100 

5.67 

0.53 

0.0152 ±0.0001 

0.0827 ±0.0005 

10.00 ±0.00 

1000 

{1000} 

s 


1000 (10000) 

100 

0.46 

0.17 

0.0588 ±0.0010 

0.0811 ±0.0064 

5.00 ±0.00 

200 

{200} 

s 

3 

1000 (10000) 

100 

1.50 

0.34 

0.0157 ±0.0002 

0.0864 ±0.0014 

15.00 ±0.00 

67 

[66 , 67] 

> 

II 

2 500 (10000) 

100 

0.73 

0.19 

0.0586 ±0.0007 

0.0787 ±0.0045 

5.00 ±0.00 

500 

{500} 

> 

II 

2 500 (10000) 

100 

1.62 

0.53 

0.0149 ±0.0001 

0.0814 ±0.0007 

15.00 ±0.00 

167 

[166 , 167] 

6 

& 

5 000 (10000) 

100 

1.87 

0.24 

0.0583 ±0.0004 

0.0775 ±0.0027 

5.00 ±0.00 

1000 

{1000} 

Q 

% 

5000 (10000) 

100 

2.10 

0.84 

0.0149 ±0.0001 

0.0812 ±0.0006 

15.00 ±0.00 

333 

[333 , 334] 



10 000 (10000) 

100 

5.88 

0.30 

0.0563 ±0.0003 

0.0617 ±0.0029 

5.00 ±0.00 

2000 

{2000} 



10000 (10000) 

100 

4.80 

0.87 

0.0150 ±0.0001 

0.0819 ±0.0005 

15.00 ±0.00 

667 

[666 , 667] 



1000 (10000) 

100 

0.52 

0.19 

0.0586 ±0.0010 

0.0796 ±0.0070 

6.00 ±0.00 

167 

[166 , 167] 

S 

o 

1000 (10000) 

100 

1.88 

0.39 

0.0162 ±0.0003 

0.0891 ±0.0017 

20.00 ±0.00 

50 

{50} 

> 

II 

2 500 (10000) 

100 

0.73 

0.21 

0.0586 ±0.0006 

0.0788 ±0.0044 

6.00 ±0.00 

417 

[416 , 417] 

> 

II 

2 500 (10000) 

100 

1.99 

0.59 

0.0151 ±0.0001 

0.0822 ±0.0006 

20.00 ±0.00 

125 

{125) 

C) 

& 

5000 (10000) 

100 

1.67 

0.26 

0.0586 ±0.0005 

0.0793 ±0.0033 

6.00 ±0.00 

833 

[833 , 834] 

6 

% 

5 000 (10000) 

100 

2.27 

0.89 

0.0149 ±0.0001 

0.0814 ±0.0006 

20.00 ±0.00 

250 

{250} 

p: 

=1fc 

10 000 (10000) 

100 

5.01 

0.33 

0.0565 ±0.0003 

0.0637 ±0.0026 

6.00 ±0.00 

1667 

[1666 , 1667] 

CP 


10 000 (10000) 

100 

4.24 

1.14 

0.0149 ±0.0001 

0.0812 ±0.0005 

20.00 ±0.00 

500 

{500} 


o 

1000 (10000) 

100 

0.89 

0.25 

0.0593 ±0.0011 

0.0845 ±0.0069 

10.00 ±0.00 

100 

{100} 

S 

o 

•'3' 

1000 (10000) 

100 

3.96 

0.59 

0.0194 ±0.0007 

0.1053 ±0.0033 

40.00 ±0.00 

25 

{25) 

> 

II 

2 500 (10000) 

100 

0.99 

0.33 

0.0579 ±0.0006 

0.0744 ±0.0043 

10.00 ±0.00 

250 

{250} 

> 

II 

2 500 (10000) 

100 

4.06 

0.82 

0.0157 ±0.0002 

0.0859 ±0.0010 

40.00 ±0.00 

62 

[62 , 63] 

5 

% 

5 000 (10000) 

100 

1.51 

0.36 

0.0590 ±0.0005 

0.0824 ±0.0033 

10.00 ±0.00 

500 

{500} 

6 

% 

5 000 (10000) 

100 

4.29 

1.17 

0.0152 ±0.0001 

0.0829 ±0.0005 

40.00 ±0.00 

125 

{125) 



10 000 (10000) 

100 

3.59 

0.44 

0.0574 ±0.0003 

0.0705 ±0.0024 

10.00 ±0.00 

1000 

{1000} 



10000 (10000) 

100 

4.93 

1.75 

0.0148 ±0.0001 

0.0808 ±0.0004 

40.00 ±0.00 

250 

{250} 



1000 (10000) 

100 

1.90 

0.38 

0.0671 ±0.0019 

0.1238 ±0.0082 

20.00 ±0.00 

50 

{50} 

S 

o 

1000 (10000) 

100 

4.96 

0.69 

0.0222 ±0.0011 

0.1181 ±0.0045 

50.00 ±0.00 

20 

{20) 

> 

II 

2 500 (10000) 

100 

2.01 

0.53 

0.0578 ±0.0007 

0.0742 ±0.0052 

20.00 ±0.00 

125 

{125} 

> 

II 

2 500 (10000) 

100 

5.09 

0.93 

0.0161 ±0.0002 

0.0883 ±0.0011 

50.00 ±0.00 

50 

{50) 

6 

% 

5 000 (10000) 

100 

2.19 

0.64 

0.0583 ±0.0005 

0.0779 ±0.0031 

20.00 ±0.00 

250 

{250} 

Q 

% 

5000 (10000) 

100 

5.30 

1.29 

0.0153 ±0.0001 

0.0839 ±0.0005 

50.00 ±0.00 

100 

{100) 



10000 (10000) 

100 

3.20 

0.67 

0.0582 ±0.0003 

0.0770 ±0.0021 

20.00 ±0.00 

500 

{500} 



10 000 (10000) 

100 

6.43 

1.94 

0.0149 ±0.0001 

0.0810 ±0.0004 

50.00 ±0.00 

200 

{200} 


Table 9: Experimental results relating to the artificial data of Type Table 10: Experimental results relating to the artificial data of Type 
III IV 


Optimal Learning Rates for Localized SVMs 



Eberts and Steinwart 



data set sizes 

runs 

train time 

test time 

test error 

L2-error 

#0fw8 

ws size: median ws 

size: range 


1000 (10000) 

100 

0.53 

0.05 

0.0649 ±0.0004 

0.0370 ±0.0055 

1.00 



> 

2 500 (10000) 

100 

3.26 

0.09 

0.0647 ±0.0002 

0.0330 ±0.0032 

1.00 




5 000 (10000) 

100 

12.51 

0.15 

0.0640 ±0.0001 

0.0240 ±0.0015 

1.00 




10000 (10000) 

100 

49.29 

0.28 

0.0640 ±0.0001 

0.0223 ±0.0015 

1.00 




_ 1 

data set sizes 

runs 

train time 

test time 

test error 

Z/ 2 -error 

# of ws 

ws size: median 

ws siz. 

5: range 

S 

lO 

1000 (10000) 

100 

4.85 

0.66 

0.0797 ±0.0021 

0.1258 ±0.0076 

44.56 ±2.07 

22 

[6 

,54] 

> 

o 

2 500 (10000) 

100 

4.76 

0.77 

0.0719 ±0.0009 

0.0882 ±0.0043 

46.18 ±2.55 

54 

[14 

. 116] 

Cu 

II 

5 000 (10000) 

100 

4.88 

0.95 

0.0682 ±0.0005 

0.0663 ±0.0033 

47.45 ±2.61 

105 

[20 

. 214] 

> 


10000 (10000) 

100 

5.77 

1.14 

0.0657 ±0.0003 

0.0460 ±0.0027 

48.05 ±2.78 

208 

[38 

, 407] 

S 


1000 (10000) 

100 

1.65 

0.30 

0.0703 ±0.0011 

0.0820 ±0.0068 

14.98 ±1.44 

67 

[16 

. 129] 

> 

o 

2500 (10000) 

100 

1.64 

0.37 

0.0670 ±0.0006 

0.0565 ±0.0045 

15.17 ±1.52 

165 

[40 

, 334] 

a. 

II 

5000 (10000) 

100 

2.26 

0.45 

0.0652 ±0.0003 

0.0404 ±0.0035 

15.69 ±1.47 

319 

[74 

, 644] 

> 


10 000 (10000) 

100 

4.81 

0.62 

0.0645 ±0.0002 

0.0307 ±0.0023 

15.83 ±1.49 

632 

[167 

, 1343] 

S 


1000 (10000) 

100 

0.56 

0.16 

0.0670 ±0.0008 

0.0578 ±0.0064 

5.33 ±0.47 

188 

[68 

, 464] 

> 


2500 (10000) 

100 

1.05 

0.19 

0.0656 ±0.0004 

0.0431 ±0.0042 

5.33 ±0.47 

469 

[171 

. 1162] 

cL 


5 000 (10000) 

100 

3.01 

0.26 

0.0643 ±0.0002 

0.0290 ±0.0025 

5.42 ±0.52 

923 

[297 

, 2445] 

> 


10000 (10000) 

100 

10.81 

0.41 

0.0641 ±0.0001 

0.0243 ±0.0022 

5.54 ±0.61 

1805 

[555 

, 4840] 

S 


1000 (10000) 

100 

0.49 

0.13 

0.0653 ±0.0006 

0.0427 ±0.0061 

1.65 ±0.48 

606 

[266 

, 1000] 

> 


2 500 (10000) 

100 

2.21 

0.11 

0.0649 ±0.0003 

0.0357 ±0.0042 

1.70 ±0.48 

1471 

[603 

, 2500] 

pi 


5000 (10000) 

100 

9.32 

0.18 

0.0641 ±0.0001 

0.0248 ±0.0018 

1.54 ±0.50 

3247 

[1348 

, 5000] 

> 


10000 (10000) 

100 

35.16 

0.31 

0.0640 ±0.0001 

0.0227 ±0.0018 

1.62 ±0.49 

6173 

[2608 

, 10000] 

s 


1000 (10000) 

100 

0.54 

0.07 

0.0649 ±0.0004 

0.0371 ±0.0055 

1.00 ±0.00 

1000 

{1000} 

> 


2500 (10000) 

100 

3.25 

0.09 

0.0647 ±0.0002 

0.0331 ±0.0032 

1.00 ±0.00 

2500 

{2500} 

pi 


5 000 (10000) 

100 

12.54 

0.16 

0.0640 ±0.0001 

0.0240 ±0.0015 

1.00 ±0.00 

5000 

{5000} 

> 


10 000 (10000) 

100 

49.35 

0.29 

0.0640 ±0.0001 

0.0223 ±0.0015 

1.00 ±0.00 

10000 

{10000} 


_1 

data set sizes 

runs 

train time 

test time 

test error 

L2-error 

#ofws 

ws size: median 

ws size: range 


- 

1000 (10000) 

100 

0.53 

0.07 

0.0649 ±0.0004 

0.0373 ±0.0056 

1.00 ±0.00 

1000 

{1000} 

> 


2 500 (10000) 

100 

3.27 

0.09 

0.0647 ±0.0002 

0.0331 ±0.0033 

1.00 ±0.00 

2500 

{2500} 

7) 


5 000 (10000) 

100 

12.55 

0.16 

0.0640 ±0.0001 

0.0240 ±0.0015 

1.00 ±0.00 

5000 

{5000} 

p: 


10 000 (10000) 

100 

50.04 

0.29 

0.0640 ±0.0001 

0.0223 ±0.0015 

1.00 ±0.00 

10000 

{10000} 



1000 (10000) 

100 

0.38 

0.09 

0.0652 ±0.0005 

0.0409 ±0.0056 

2.00 ±0.00 

500 

{500} 

> 


2 500 (10000) 

100 

1.71 

0.12 

0.0649 ±0.0003 

0.0352 ±0.0037 

2.00 ±0.00 

1250 

{1250} 

5 

I 

5 000 (10000) 

100 

6.39 

0.18 

0.0640 ±0.0001 

0.0242 ±0.0019 

2.00 ±0.00 

2500 

{2500} 



10000 (10000) 

100 

25.47 

0.30 

0.0640 ±0.0001 

0.0221 ±0.0014 

2.00 ±0.00 

5000 

{5000} 

S 


1000 (10000) 

100 

0.48 

0.17 

0.0655 ±0.0005 

0.0440 ±0.0056 

5.00 ±0.00 

200 

{200} 

> 


2500 (10000) 

100 

1.02 

0.19 

0.0654 ±0.0004 

0.0426 ±0.0044 

5.00 ±0.00 

500 

{500} 

6 


5000 (10000) 

100 

2.82 

0.24 

0.0643 ±0.0002 

0.0295 ±0.0026 

5.00 ±0.00 

1000 

{1000} 



10 000 (10000) 

100 

10.63 

0.37 

0.0641 ±0.0001 

0.0248 ±0.0017 

5.00 ±0.00 

2000 

{2000} 


2 

1000 (10000) 

100 

0.85 

0.25 

0.0657 ±0.0005 

0.0471 ±0.0054 

10.00 ±0.00 

100 

{100} 

> 

II 

2500 (10000) 

100 

1.05 

0.33 

0.0656 ±0.0004 

0.0448 ±0.0041 

10.00 ±0.00 

250 

{250} 

6 


5 000 (10000) 

100 

2.04 

0.35 

0.0646 ±0.0002 

0.0335 ±0.0032 

10.00 ±0.00 

500 

{500} 



10000 (10000) 

100 

6.34 

0.51 

0.0644 ±0.0001 

0.0299 ±0.0019 

10.00 ±0.00 

1000 

{1000} 

s 

2 

1000 (10000) 

100 

1.47 

0.32 

0.0664 ±0.0006 

0.0537 ±0.0059 

15.00 ±0.00 

67 

[66,67] 

> 

II 

2 500 (10000) 

100 

1.63 

0.43 

0.0656 ±0.0004 

0.0449 ±0.0040 

15.00 ±0.00 

167 

[166 , 167] 

6 

i 

5000 (10000) 

100 

2.18 

0.49 

0.0647 ±0.0002 

0.0350 ±0.0030 

15.00 ±0.00 

333 

[333 , 334] 



10000 (10000) 

100 

5.23 

0.58 

0.0646 ±0.0001 

0.0335 ±0.0022 

15.00 ±0.00 

667 

[666 , 667] 



1000 (10000) 

100 

1.92 

0.38 

0.0671 ±0.0007 

0.0601 ±0.0054 

20.00 ±0.00 

50 

{50} 

> 

II 

2500 (10000) 

100 

2.00 

0.52 

0.0659 ±0.0004 

0.0473 ±0.0037 

20.00 ±0.00 

125 

{126} 

7) 

s 

5 000 (10000) 

100 

2.32 

0.61 

0.0647 ±0.0002 

0.0356 ±0.0029 

20.00 ±0.00 

250 

{250} 

p: 


10 000 (10000) 

100 

4.75 

0.69 

0.0647 ±0.0002 

0.0353 ±0.0021 

20.00 ±0.00 

500 

{500} 


p 

1000 (10000) 

100 

3.89 

0.60 

0.0707 ±0.0011 

0.0850 ±0.0065 

40.00 ±0.00 

25 

{25} 

> 

II 

2 500 (10000) 

100 

4.05 

0.81 

0.0669 ±0.0005 

0.0577 ±0.0039 

40.00 ±0.00 

62 

[62 , 63] 

5 

i 

5 000 (10000) 

100 

4.47 

1.00 

0.0650 ±0.0002 

0.0388 ±0.0030 

40.00 ±0.00 

125 

{125} 



10000 (10000) 

100 

5.66 

1.24 

0.0649 ±0.0002 

0.0379 ±0.0022 

40.00 ±0.00 

250 

{250} 


o 

1000 (10000) 

100 

4.70 

0.69 

0.0726 ±0.0010 

0.0960 ±0.0051 

50.00 ±0.00 

20 

{20} 

> 

II 

2500 (10000) 

100 

5.06 

0.90 

0.0675 ±0.0005 

0.0627 ±0.0038 

50.00 ±0.00 

50 

{50} 

6 

i 

5000 (10000) 

100 

5.33 

1.15 

0.0651 ±0.0002 

0.0410 ±0.0027 

50.00 ±0.00 

100 

{100} 



10 000 (10000) 

100 

6.57 

1.49 

0.0649 ±0.0002 

0.0383 ±0.0020 

50.00 ±0.00 

200 

{200} 


Table 11: Experimental results relating to the artificial data of Type 
V 



